spholmes / F1000_workflow

43 stars 33 forks source link

do we expect step of constructing phen.tree take too long? #17

Open DuyDN opened 7 years ago

DuyDN commented 7 years ago

Hi, I have a question that how long does it take to construct the phen tree. I mean that run the optim.pml command took me more than 3 days and not done so far

fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0))

I found some trick on internet but not sure if it can be apply with the dada2 workflow which help me to complete the construct tree

fitGTR <- optim.pml(fitGTR, rearrangement = "stochastic", ratchet.par = list(iter = 5L, maxit = 5L, prop = 1/3))

data can be download here and the code i run below

(https://drive.google.com/file/d/0B5ombffhUkLFNGQwLWpBblJpdGM/view?usp=sharing)

Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE) dim(seqtab.nochim) sum(seqtab.nochim)/sum(seqtab) seqs <- getSequences(seqtab.nochim) names(seqs) <- seqs # This propagates to the tip labels of the tree alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA) save(seqs, ps,taxa, file = "dada.phylo.Rdata") length(seqtab.nochim) dim(seqtab.nochim) library(phangorn) phang.align <- phyDat(as(alignment, "matrix"), type="DNA") dim(alignment) dm <- dist.ml(phang.align) compact.align <- unique(phang.align) length(compact.align) # ~5000 length(phang.align) treeNJ <- NJ(dm) # Note, tip order != sequence order fit = pml(treeNJ, data=phang.align)

negative edges length changed to 0!

fitGTR <- update(fit, k=4, inv=0.2)

this one did not complete

fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0))

this one complete on a day

fitGTR <- optim.pml(fitGTR, rearrangement = "stochastic", ratchet.par = list(iter = 5L, maxit = 5L, prop = 1/3))

benjjneb commented 7 years ago

Hi @DuyDN

Thanks for moving this issue over here. Yes, we are aware that the GTR fitting in phangorn does not scale well to datasets with thousands (or more) different sequences. The recommended alternative right now is to use RaxML via the ips package within R.

This was discussed in an earlier issue (https://github.com/benjjneb/dada2/issues/88), in particular see the very helpful comments from @giriarteS

Also for my co-authors @spholmes @krisrs1128 @jfukuyama @joey711 -- this is something that we should consider updating if there is a new version of the F1000 workflow, as I've gotten several emails/opened issues in the dada2 repository from people hitting this same roadblock.