tax_glom is taking too long

tizioto commented 5 years ago

Dear, I have the phyloseq object https://drive.google.com/file/d/1U-YdB5v3oEjAyUn4V5EXi_L8RJJI6e5n/view?usp=sharing and am trying to run tax_glom, but it is taking too long. It has been running for more than 3 days. Do you have any idea why this is happening? Thank you Best regards

benjjneb commented 5 years ago

The speedyseq package will help with this, it has a much faster implementation of tax_glom: https://github.com/mikemc/speedyseq

@mikemc can tell you more.

mikemc commented 5 years ago

phyloseq::tax_glom() gets much slower as the number of taxa increases. In your case, the number of taxa is extremely large (66781 taxa) and thus why it is taking so long. But as @benjjneb said I released an add-on package with a much faster implementation version of tax_glom() with instructions for use at https://github.com/mikemc/speedyseq. I have never tried it with such a large number of taxa (or samples) before, but it seems to work pretty well: a genus-level tax_glom takes ~5 seconds on my laptop.

library(speedyseq)

ps <- readRDS("ps2.all.rds")
ps
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 66781 taxa and 1420 samples ]
#> sample_data() Sample Data:       [ 1420 samples by 20 sample variables ]
#> tax_table()   Taxonomy Table:    [ 66781 taxa by 8 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 66781 tips and 66050 internal nodes ]
system.time(ps1 <- tax_glom(ps, "genus"))
#>    user  system elapsed 
#>   4.903   0.492   5.409

The current version of speedyseq (v0.1.0) is archived on Zenodo, making it citable and suitable for use in reproducible workflows.

tizioto commented 5 years ago

Dear all, This worked perfectly. Thank you. Best Regards

Dra. Polyana Tizioto NGS Soluções Genômicas

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Livre de vírus. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail. <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Thu, Oct 10, 2019 at 12:32 PM Michael McLaren notifications@github.com wrote:

phyloseq::tax_glom() gets much slower as the number of taxa increases. In your case, the number of taxa is extremely large (66781 taxa) and thus why it is taking so long. But as @benjjneb https://github.com/benjjneb said I released an add-on package with a much faster implementation version of tax_glom() with instructions for use at https://github.com/mikemc/speedyseq. I have never tried it with such a large number of taxa (or samples) before, but it seems to work pretty well: a genus-level tax_glom takes ~5 seconds on my laptop.

library(speedyseq) ps <- readRDS("ps2.all.rds") system.time(ps1 <- tax_glom(ps, "genus"))#> user system elapsed #> 4.903 0.492 5.409

The current version of speedyseq (v.0.1.0) is archived on Zenodo https://zenodo.org/badge/latestdoi/179732395, making it citable and suitable for use in reproducible workflows.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/1245?email_source=notifications&email_token=ADU2WSLOZMLWGPRYI6TL2VLQN5DH5A5CNFSM4I7M4NA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA4Y5BA#issuecomment-540642948, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU2WSMGY4EIULURGRYUUJLQN5DH5ANCNFSM4I7M4NAQ .

joey711 / phyloseq

tax_glom is taking too long #1245