nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

TREEtime computational bottleneck #410

Open sebdart opened 4 years ago

sebdart commented 4 years ago

Context
TREETime is a computational bottleneck in the Augur bioinformatic pipeline, especially when modeling thousands and thousands of raw genomic data.

Bottom line up, how to have Augur able to compute ten of thousands of new raw genomic SARS-CoV-2 sequences faster.

Description
As GISAID adds hundred of new genomic samples of SARS-CoV-2 a day, I noticed that the computational bottleneck comes in the Refine step with TreeTime. TreeTime does not seem to be Shared Memory Processing compliant (ie., not parallelized) and takes as much time as IQ-Tree in itself. By the time I write this note (mid-May 2020), we have nearly 16,000 SARS-CoV-2 genomic sequences from GISAID; by June-2020, we will be beyond 20,000 of FASTA sequences.

Is there any options to speed Augur/this up? Have we considered any other alternative to TREETime?, such as the Fast Dating using Least-Squares Criteria and Algorithms (which seems to run an order of magnitude faster than TREETime): http://www.atgc-montpellier.fr/LSD/

Has anyone considered to add LSD as an alternative to TREETime in Augur and nCOV bioinformatic pipeline? Has anyone considered to speedup TREETime, such as SMP parallelizing it? Has anyone considered any other alternative to estimate SAR-CoV-2 clock phylogeny in Augur?

Possible solution
Add LSD in Augur Pipeline in Refine step as a faster alternative to TREETime? Parallelizing TreeTime? Any other models for molecular clock phylogeny in Augur?

jameshadfield commented 4 years ago

You're broadly correct that TreeTime is the bottleneck currently and there are two avenues we have discussed: (1) There's discussion in the treetime repo about various ways to speed this up and there are a number of work-in-progress versions there. (2) A version of IQ-TREE + LSD was recently released. We are keen to include this option within augur but haven't discussed exactly how to do it as it combines what is currently two discrete steps: augur tree & augur refine.

sebdart commented 4 years ago

If I could suggest to keep the Refine-step as it is because the Augur's Tree-step already takes a few hours on multi-CPUs; at the end of the Tree-step, we definitely want to have a tree_raw.awk file saved in the results/ directory ... just in case we need to restart the simulation with snakemake after the Tree-step (eg., computer crash, lost connection with the Cloud, Cloud going down because of maintenance, etc.)

I was not aware of this new version of IQ-Tree which was released just a few days ago (v 2.0.5, May 15, 2020) with Phylogenetic Dating from a LSD2 model (Fast Dating using Least-Squares Criteria and Algorithms): http://www.iqtree.org/doc/Dating#dating-an-existing-tree

If one wants to keep consistency with the current ncov Augur's TreeTime Refine-step (knowing that LSD and TreeTime are completely different models and acknowledging that TreeTime is a very top notch sophisticated model):

Augur Refine with TreeTime --tree results/tree_raw.awk --alignment results/masked.fasta --metadata results/metadata_adjust.tsv --root Wuhan-Hu-1/2019 Wuhan/WHO1/2019 --timetree --coalescent skyline --date-inference marginal --date-confidence --no-covariance --clockfilter-iqd 4

then, with LSD2 in IQ-TREE, similar Augur Refine options could be roughly "imitated" by:

iqtree -s results/masked.fasta --date results/metadata_adjust.tsv -te results/tree_raw.awk --date-ci 100 -m GTR -o "Wuhan-Hu-1/2019 Wuhan/WHO1/2019" --date-options "-w rate_file_with_clock_rate=0.0008"

The SARS-CoV-2 clock rate (0.0008) that is currently used in ncov must now be provided in a separate file "rate_file_with_clock_rate" according to LSD model ... or, one would just let the LSD model estimate itself (as I do with TreeTime) and not worry of this fixed clock rate (I did both, the tree topology is the same! However, the mutation rate/site/year is different by a factor 2 on 12,5000 sequences). The option setup above would also imitates the --no-covariance/covariation=False option of Augur/TreeTime. I do not think LSD is capable of running on multi-CPUs (but I could be wrong if IQ-Tree did some changes to LSD original model; then perhaps setting the option -nt?).

There will be an issue with the metadata_adjust.tsv file, which requires for IQ-Tree LSD2 model a much simpler format for the sample ID and its date, ie., Wuhan-Hu-1|2019-12-26 or Wuhan-Hu-1 2019-12-26

but not the current metadata_adjust.tsv: Wuhan-Hu-1/2019 ncov EPI_ISL_402125 MN908947 2019-12-26 Asia [...]

Giving the option for the Augur Refine step to include IQ-Tree's LSD2 Refinement seems to be a worthwhile endeavor ... TreeTime is a great model but, boy, it is slow on 12,500 sequences (and that number is increasing non stop).

Glad to have this conversation and, at least, have some hopes something could be done regarding this clock bottleneck.

Thank you!

SEbastien

rneher commented 4 years ago

I'd be curious how the LSD output looks like. With 12k sequences, the bottleneck of treetime is likely the polytomy resolution. This could be switched off (LSD doesn't do it anyway I think), but it would result in odd fork-like trees.

sebdart commented 4 years ago

It will be on my to-do list to [try to] implement and further test this with IQ-Tree v2 and its own implemented LSD time phylogeny ... the only thing is I am not sure when I ll find time ...

Also, thank you for making me aware of the polytomy resolution issue (this has escaped me), I am wondering (not that I know anything) whether we could use one of IQ-TREE option by default during the Augur's TREE-step:

-czb [Collapse near zero branches, so that the final tree may be multifurcating. This is useful for bootstrapping in the presence of polytomy to reduce bootstrap supports of short branches.]

and then use Phylogenetic Dating from LSD2 ... I noticed that NEXTStrain's Augur does not seem to take advantage of that option and rather let TreeTime to do this by default with its NOT(--keep-polytomies) ... I would have thought that this polytomy resolution should have be done before Phylogenetic Dating ...

Bottom line, is that there may be multiple routes to try but it may impact the overall SARS-COV-2 tree (at least in time, hopefully not too much in terms of its topology). TreeTime is a great tool indeed, just too slow (not that I am much impressed by IQ-TREE computational speed in itself).

Thank you for your comment; this is helpful to make me understand better what s going on in this pipeline ...

Sebastien

trvrb commented 3 years ago

I do think we could eventually aim to have --method lsd and --method treetime as options to augur refine. Or to have other recommended flags to speed up TreeTime.

sebdart commented 3 years ago

Thank you very much, I'd love to see/test that ... I aim to use LSD on the GISAID's audacity tree for large scale scale testing (when I ll start to have more time to focus on things I want to do). Also I meant to say this before but I found out that the option -czb in IQ-Tree (getting rid of near zero branches) make treetime work much much faster (on a large simulation with 30,0000 sequences, treetime performs twice to three times faster then--depending on the setup). However, the unfortunate caveat is that -czb is not parallelized in IQ-Tree ... so now, IQ-Tree is the slow one, but it saves time for treetime ... it is hard to say which is which better ... perhaps, replacing IQ-Tree with raxmlHPC and use LSD as well ...

babarlelephant commented 3 years ago

@sebdart Hi, I am stuck at a much smaller number of sequences than you do, but I found there that iqtree2 has a very fast algorithm to add many sequences to an almost optimal subsampled tree.

I made a nexstrain tree following the standard pipeline with 419 French sequences then iqtree2 -seed 1729 -s france.fasta -g france_subsampled.nwk -n 0 -m JC -fixbr -nt 1 --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -pre iqtree_seqsadded_mp.nwk added in one minute the 4500 other sequences to the newick file, I ran augur refine --timetree then ancestral, translate, export which took maybe 40 min and got a tree with the whole 4600 French sequences which doesn't look too bad. I will try -czb, -keep-polytomies and so on tomorrow. I tried to replace augur refine --timetree by iqtree/LSD but it is taking a while when trying to create the substitution model, no idea if I can create one from the subsampled dataset.

sebdart commented 3 years ago

Bravo pour les 419 séquences de la France! J'apprecie vos conseils, cela va m'aider énormément ... Actually, it was my plan to follow that pathway to add to a prebuilt IQTREE tree new set of sequences ... the reason I have not pursued yet this pathway is because of visualization of such massive tree (>>> 50,000 sequences in keeping Metadata capabilities in a manner of NextStrain) ... I have ideas to break this data challenge but I dont have the manpower/time to address this visualization challenge at this time ... although, this may change in the future ... Continuer le travail pour la France. A bientôt. Merci. Sebastien.

fanninpm commented 2 years ago

Augur v18.0.0 was released not too long ago, and it adds the --use-fft flag to augur refine. Using this flag cut down augur refine's computational time considerably (from 57 minutes down to 35 minutes, on a rather large tree of ~3800 genomes).

I used the following patch to patch workflow/snakemake_rules/main_workflow.smk, which you should feel free to modify:

Patch to workflow/snakemake_rules/main_workflow.smk ```diff --- a/workflow/snakemake_rules/main_workflow.smk 2022-10-03 20:11:11.081133161 +0000 +++ b/workflow/snakemake_rules/main_workflow.smk 2022-10-03 20:12:06.449659266 +0000 @@ -843,7 +843,8 @@ divergence_unit = config["refine"]["divergence_unit"], clock_filter_iqd=f"--clock-filter-iqd {config['refine']['clock_filter_iqd']}" if config["refine"].get("clock_filter_iqd") else "", keep_polytomies = "--keep-polytomies" if config["refine"].get("keep_polytomies", False) else "", - timetree = "" if config["refine"].get("no_timetree", False) else "--timetree" + timetree = "" if config["refine"].get("no_timetree", False) else "--timetree", + use_fft = "" if config["refine"].get("no_timetree", False) or config["refine"]["date_inference"] != "marginal" else "--use-fft" conda: config["conda_environment"] shell: """ @@ -855,6 +856,7 @@ --output-node-data {output.node_data} \ --root {params.root} \ {params.timetree} \ + {params.use_fft} \ {params.keep_polytomies} \ --clock-rate {params.clock_rate} \ --clock-std-dev {params.clock_std_dev} \ ```

Does this sound good?

victorlin commented 2 years ago

@fanninpm that's great to hear! I'm not very involved in the TreeTime development, but I noticed that --use-fft was added to the Nextstrain monkeypox workflow then recently removed. @corneliusroemer is this because of https://github.com/nextstrain/augur/issues/1057? Do you think it'd be beneficial to add to the ncov workflow?

corneliusroemer commented 2 years ago

Yes @victorlin it would be good to add --use-fft to ncov in the mid-term, once we're sure that it's no longer buggy.

Maybe we can make an issue to add --use-fft to all of our computationally demanding workflows in the mid-term?