Continue crashed analysis from tree inference step

diegomarquezp commented 3 years ago

Hello Siavash. Hoping you are well. I'm reaching the final steps to build a tree from the Silva 13.8 dataset. Unfortunately, it crashed during the tree inference step. The only iteration for the realignment step took a bit more than 3 weeks. I was checking the wiki for a way to continue with the alignment produced with this step, but apparently the --aligned option still goes over a realignment step. I did some time estimations with subsets of the Silva database and it should take about 3 more days before finishing the tree, only if we manage to skip the realignment, otherwise it would be 3 more weeks again.

I was wondering if I'm missing an option from the wiki to continue from this substep of the iteration. Otherwise, I can try to modify the code to provide the last alignment to the first iteration. If that's the case, I will need to kindly ask you to refer me to the involved files in this change or any development documentation to aid in solving this situation.

Update: I found out that the inference step consists of a call to fasttreeMP - the debug output shows the exact args to execute the binary with. I'm thinking that the final steps would involve running a modified version of treeholder.py

Thanks beforehand for your help.

smirarab commented 3 years ago

To be clear, you ran PASTA using the alignment from the previous stage as input (-i)? And it still tried to do an alignment?

Also, are you planning to do one iteration or more? If you want to have only one iteration, there is no reason to run FastTree inside PASTA. You can just run it outside.

On Mon, Aug 9, 2021 at 12:36 PM Diego Alonso Marquez Palacios < @.***> wrote:

Hello Siavash. Hoping you are well. I'm reaching the final steps to build a tree from the Silva 13.8 dataset. Unfortunately, it crashed during the tree inference step due to the outdated fasttreeMP (which I did not expect to be used again in the processing when preparing a new server) The only iteration for the realignment step took a bit more than 3 weeks. I was checking the wiki for a way to continue with the alignment produced with this step, but apparently the --aligned option still goes over a realignment step. I did some time estimations with subsets of the Silva database and it should take about 3 more days before finishing the tree, only if we manage to skip the realignment, otherwise it would be 3 more weeks again.

I was wondering if I'm missing an option from the wiki to continue from this substep of the iteration. Otherwise, I can try to modify the code to provide the last alignment to the first iteration. If that's the case, I will need to kindly ask you to refer me to the involved files in this change or any development documentation to aid in solving this situation.

Thanks beforehand for your help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/smirarab/pasta/issues/61, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOHN67O7H37DUPIDPBDT4AU4HANCNFSM5B2VJU4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Siavash Mirarab

diegomarquezp commented 3 years ago

Last month I started pasta with -i pastajob_temp_iteration_initialsearch_seq_alignment.txt --aligned The last file produced in the folder today was pastajob_temp_iteration_0_seq_alignment.txt

So is it possible to just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt ?

From the subset tests logs, I'm guessing the command below would be useful for this last step?:

/home/ec2-user/pasta-code/pasta/bin/fasttreeMP -quiet -nt -gtr -gamma -                                              **configuration)
fastest -intree /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre -log /home/ec2-user/.pasta/past
ajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/log /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/i                         pmj.launch_alignment(context_str=context_str)
nput.fasta

(assuming input.fasta == ...iteration_0_seq_alignment.txt)

edit: Yes, it did perform a realignment step (one iteration)

Thank you!

smirarab commented 3 years ago

Hi Diego,

Yes, you can just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt.

Note that default PASTA does 3 iterations, but most of the advantage comes from the first iteration. The result of running fasttree on this alignment will give you the result of the first iteration. I think given the size of your dataset, one iteration is reasonable and sufficient. If you wanted to do one more iteration, you can, but not necessary (I think).

Three more caveats.

...iteration_0_seq_alignment.txt is already masked to remove super gappy sites. The unmasked file is also available (...temp_iteration_0_seq_unmasked_alignment.gz) but you don't want to give that to FastTree. Also, it is in a format that needs translation (more on that below). However, you may want to see how long ...iteration_0_seq_alignment.txt is and you may decide to mask even a bit more; the default is to mask a site if it is a gap in >99.9% of species.
the ...iteration_0_seq_alignment.txt file uses PASTA's internal names for species. These names can be translated back to the original names using a simple text file (._temp_name_translation.txt) and a command that I will send you.
If you are going to use the FastTree tree as your final tree, you may want to eliminate the starting tree (/home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre). It will take a bit longer, but that should be fine.

For both 1 and 2, I have scripts that are shipped as part of PASTA. Let me write a quick markdown file and describe these. In the meantime, you can start your FastTree run. I hope to get to this in a day or so.

Thanks Siavash

On Mon, Aug 9, 2021 at 1:10 PM Diego Alonso Marquez Palacios < @.***> wrote:

Last month I started pasta with -i pastajob_temp_iteration_initialsearch_seq_alignment.txt --aligned The last file produced in the folder today was pastajob_temp_iteration_0_seq_alignment.txt

So is it possible to just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt ?

From the subset tests logs, I'm guessing the command below would be useful for this last step?:

/home/ec2-user/pasta-code/pasta/bin/fasttreeMP -quiet -nt -gtr -gamma - **configuration) fastest -intree /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre -log /home/ec2-user/.pasta/past ajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/log /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/i pmj.launch_alignment(context_str=context_str) nput.fasta

(assuming input.fasta == ...iteration_0_seq_alignment.txt)

Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/smirarab/pasta/issues/61#issuecomment-895508028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXODNAQDI74HBJXKR5CLT4AY4VANCNFSM5B2VJU4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Siavash Mirarab

diegomarquezp commented 3 years ago

Thanks so much Siavash. I'll let you know about how fasttree goes.

smirarab commented 3 years ago

Diego,

I added information about getting the unmasked alignment from the PASTA temporary files and name mapping here:

https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md#step-6-using-run_seqtoolspy

and in particular

https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md#restart-pasta-from-the-previous-runs

On Mon, Aug 9, 2021 at 8:53 PM Diego Alonso Marquez Palacios < @.***> wrote:

Thanks so much Siavash. I'll let you know about how fasttree goes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/smirarab/pasta/issues/61#issuecomment-895708014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOGG45W4PSV3JXS6ZB3T4CPE3ANCNFSM5B2VJU4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Siavash Mirarab

diegomarquezp commented 3 years ago

Hi Siavash.

Thanks for the added steps on the tutorial. I could restart the crashed step using the first and only iteration's alignment and finally obtain a tree this week. With the tree and aligned sequences, I tried to run SEPP on it through QIIME after importing the PASTA results, but it took way too long (15+ hours) compared with the SEPP reference database that you published for SILVA 12.8 (40 minutes). What I have noticed is, the aligned sequences contained in the 12.8 QZA file are only a subset of the whole 12.8 reference database. I was wondering if you used any special criteria to extract the subset. Would restrict the aligned sequences set to 2-3 sequences per species do the work? That would roughly match the size of the sequences of 12.8. That would be the only step needed as we already have the alignment.

Thank you very much.

smirarab commented 3 years ago

Hi Diego,

There are two potential reasons.

Default output of PASTA is not masked for super gappy sites. There are many sites that have just a couple of letters in them among millions of species. We need to remove those before using them as input to SEPP. For removing gappy sites, I suggest you use the run_seqtools.py method that you learned about in the tutorial. I would remove sites with 99.9% gaps or 99% gaps. You can try different thresholds and see how many sites are left in the final alignment. You should hopefully have something in the same order as 12.8 (thousands of sites).
Once (1) is taken care of, if the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.

Thanks

diegomarquezp commented 3 years ago

Hi Siavash, thanks for the response. I will let you know about this

smirarab / pasta

Continue crashed analysis from tree inference step #61