nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
159 stars 49 forks source link

Making a core SNP tree to go along with final.tre #412

Open djbradshaw2 opened 1 month ago

djbradshaw2 commented 1 month ago

Dear Gubbins Creators,

Thanks again for a great tool! Sorry to open up so many questions. This is a multiple tools question, so please let me know if anything is out of gubbins scope.

It is my understanding that the .final_tree.tre is made using a recombination masked version of the .filtered_polymorphic_sites.phylip file per #292 and #380. Gubbins Question: Is that correct?

I wanted to create a core SNP version of the phylogenetic tree using RaxML and IQtree to compare with the full non-recombinant SNPs *.fina_tree.tre which was created with RaxML (From gubbins.log - Tree constructor (later iterations),raxmlHPC-PTHREADS-AVX2,8.2.12,GTRGAMMA).

Multiple Tools Question: To do this should I do the following? I was ultimately unsure where to get the constant site counts from.

  1. Extract the core SNPs alignment (per Snippy GitHub) snp-sites -c gubbins.filtered_polymorphic_sites.fasta > SX519_Chromosomal_Ref_clean.core.aln

  2. Recreate the masked version of the alignment used to make *.final_tree.tre Gubbins Question: Gubbins removed two isolates from original snippy input due to missing data percentages > 25% increasing the number of SNPs, will using this gff overcome that despite using the original snippy input alignment?

mask_gubbins_aln.py --aln SX519_Chromosomal_Ref_clean.core.full.aln --gff gubbins.recombination_predictions.gff --out gubbins.masked.aln

  1. Get the constant site counts to be used to make the new core trees snp-sites -C gubbins.masked.aln 1123980,1204859,1205431,1129769

  2. Make the core SNP trees with output core alignment from step 1 and constant sites numbers from step 3. iqtree -fconst 1123980,1204859,1205431,1129769 -s SX519_Chromosomal_Ref_clean.core.aln raxml - script to be determined but using https://github.com/tseemann/snippy/issues/362 as guidance

Thanks for your time and help. Please let me know if you need any other information or have any questions.

Sincerely,

David

gubbins version: 3.3.0 raxml version: 8.2.12 iqtree version: 2.2.3 snippy version: 4.6.0 snp-sites version: 2.5.1

Gubbins Output Creation Scripts: snippy-clean_full_aln core.full.aln > SX519_Chromosomal_Ref_clean.core.full.aln

run_gubbins.py --first-tree-builder rapidnj --first-model JC -p gubbins -c 32 -v SX519_Chromosomal_Ref_clean.core.full.aln

nickjcroucher commented 1 month ago

Thanks! I would recommend using --no-cleanup to keep the alignment files at each iteration, and inserting a print statement at this line to output the tree_building_command variable for each iteration - this should allow you to reproduce each step.

djbradshaw2 commented 1 month ago

Hi Dr. Croucher,

Thanks for helping me with this as well. Thankfully I am running gubbins on slurm and keeping the stdout, so I have the command (I think) for the fifth iteration which is the only one I would like to reconstruct as a core SNP version. From that I have the following commands (edited to remove paths before file name for brevity).

Iteration 5

Constructing the phylogenetic tree with raxmlHPC-PTHREADS-AVX2... raxmlHPC-PTHREADS-AVX2 -T 32 -p 3695 -safe -m GTRGAMMA -f d -p 1 -s SX519_Chromosomal_Ref_clean.core.full.iteration_4.tre.phylip -n SX519_Chromosomal_Ref_clean.core.full.iteration_5 -t SX519_Chromosomal_Ref_clean.core.full.iteration_4.tre ...done. Run time: 865947.10 s

Reconstructing ancestral sequences with pyjar...

Fitting substitution model to tree... raxmlHPC-PTHREADS-AVX2 -T 32 -p 9260 -safe -m GTRGAMMA -s SX519_Chromosomal_Ref_clean.core.full.aln.snp_sites.aln -n SX519_Chromosomal_Ref_clean.core.full.iteration_5_reconstruction -t /tmp38ovm7mx/SX519_Chromosomal_Ref_clean.core.full.iteration_5.tre.rooted -f e -w /tmp38ovm7mx

Running joint ancestral reconstruction with pyjar Reading tree file: /tmp38ovm7mx/RAxML_result.SX519_Chromosomal_Ref_clean.core.full.iteration_5_reconstruction Reading info file: /tmp38ovm7mx/RAxML_info.SX519_Chromosomal_Ref_clean.core.full.iteration_5_reconstruction Frequencies: 0.187771, 0.314138, 0.309745, 0.188346 Rates: 1.101601, 4.490749, 1.306992, 0.26046, 4.211125, 1.0 Reconstructing sites on tree Printing alignment with internal node sequences: /tmp38ovm7mx/SX519_Chromosomal_Ref_clean.core.full.iteration_5.internal.joint.aln Printing tree with internal nodes labelled: /tmp38ovm7mx/SX519_Chromosomal_Ref_clean.core.full.iteration_5.internal.joint.tre Done Time for JAR preparation: 44.34410039099748 Time for JAR calculation: 23831.071987628995

Transferring pyjar results onto original recombination-corrected tree

Done transfer ...done. Run time: 907038.52 s

Running Gubbins to detect recombinations... gubbins -r -v SX519_Chromosomal_Ref_clean.core.full.aln.gaps.vcf -a 100 -b 10000 -f /tmp38ovm7mx/SX519_Chromosomal_Ref_clean.core.full.aln -t SX519_Chromosomal_Ref_clean.core.full.iteration_5.tre -m 3 -p 0.05 -i 1.0 /tmp38ovm7mx/SX519_Chromosomal_Ref_clean.core.full.iteration_5.internal.joint.aln ...done. Run time: 909932.03 s

Checking for convergence... ...done. Run time: 909935.27 s Maximum number of iterations (5) reached.

Exiting the main loop.

I am assuming that the file whose name is changed to .final_tree.tre is created from the first command? Does that mean that the .filtered_polymorphic_sites.phylip is the SX519_Chromosomal_Ref_clean.core.full.iteration_4.tre.phylip file used in that command? Also, am I correct in that the pyjar command after that seems to be creating the file that eventually becomes the *.node_labelled.final_tree.tre final output?

Sorry if I am grasping at straws here, I would honestly like to try to recreate the final tree as a core SNP version without having to rerun command with --no-cleanup but understand if that is the only way to do it.

Thanks for your time and help.

Sincerely,

David

nickjcroucher commented 1 month ago

Yes, this is the command that generates the final tree (using the recombination-filtered alignment from the previous iteration): raxmlHPC-PTHREADS-AVX2 -T 32 -p 3695 -safe -m GTRGAMMA -f d -p 1 -s SX519_Chromosomal_Ref_clean.core.full.iteration_4.tre.phylip -n SX519_Chromosomal_Ref_clean.core.full.iteration_5 -t SX519_Chromosomal_Ref_clean.core.full.iteration_4.tre

The pyjar reconstruction is just for inferring recombinations, not for tree construction.