nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
174 stars 51 forks source link

Gubbins tree scale? #272

Closed cizydorczyk closed 1 year ago

cizydorczyk commented 4 years ago

Hello,

I'm having some trouble understanding the tree scale that is output in the newick formatted tree by Gubbins. Specifically, branch lengths in this tree are on the order of 10s for me, yet in trees produced by other software (iq-tree, etc.), branch lengths are always on the order of decimal numbers.

Are the branch lengths reported by Gubbins in # of SNPs? What exactly is being output here? Ultimately I want to compare trees between Gubbins and other software, but this is difficult if the branch lengths are reported in different units.

Any clarification is much appreciated.

Thank you, Conrad

mearls999 commented 4 years ago

Hi Conrad,

I am wondering the exact same thing. Did you ever find out the answer?

Thanks,

Megan

cizydorczyk commented 4 years ago

Hi Megan,

No, I never was able to figure this out.

I use a workaround nowadays, where if I run Gubbins (instead of ClonalFrameML, which I prefer as it doesn't produce such a weird output tree), I use the script found here (https://github.com/kwongj/maskrc-svg) to mask recombinant regions from my original alignment, and re-build the tree myself using IQ-Tree.

The tree from IQ-Tree is then the one I use for any downstream analyses.

Hope that helps, Conrad

tseemann commented 4 years ago

RE: Maximum-Likelihood trees from Iqtree, raxml, fasttree etc The scale is not SNPs, it is substitutions per site. These are (subtly) different. ie. A->T->A is two subs but not a SNP

The inflated scale is probably due to only giving iqtree the variant sites, not the whole genome: https://bitsandbugs.org/2019/11/06/two-easy-ways-to-run-iq-tree-with-the-correct-number-of-constant-sites/

cizydorczyk commented 4 years ago

I have had this problem when feeding Gubbins/IQ-Tree pseudo-whole genome alignments from Snippy; I'm not sure what's going on but I also have not given it as much investigation as I probably should!

tseemann commented 4 years ago

@cizydorczyk if you have an issue post it to https://github.com/tseemann/snippy/issues/new

mearls999 commented 4 years ago

Thanks a million, Conrad!

peflanag commented 4 years ago

Maybe this was answered and I just missed it but what is the scale Gubbins trees are outputted in? They aren't SNPs per site? Is there an option to as it to put the branch lengths in SNPs per site?

katholt commented 2 years ago

I wanted to know the answer to this question too! As there was no answer here, I did some digging and found that if you take the Gubbins output file which contains the alignment of polymorphic sites ('.filtered_polymorphic_sites.fasta') and infer your own tree with RAxML, the tree that you get (which has branch lengths in subs per site in the input alignment, because that's what RAxML does) is essentially identical to the tree output by Gubbins but the Gubbins tree is scaled by a factor of the alignment length (ie the numebr of polymorphic sites). So, I am pretty sure the Gubbins output tree is scaled by alignment length to give you branch lengths in substitutions, rather than substitutions per site. (Note that the branch lengths are in substitutions, which is not exactly the same as a discrete SNP count as the branch length estimation accounts for the substitution model, reversals, missing data etc.) This means that when you analyse Gubbins trees with e.g. BactDating to get dated trees and estimate mutation rates, the mutation rate is expressed as substitutions per genome per unit time, not subs per site per unit time, so is comparable across different input trees made from alignments of different lengths.

It would be great if Nick or someone could confirm though...

nickjcroucher commented 2 years ago

Thanks for the question, sorry for the slow response to the original issue - there are three different types of branch length in the output files, so this is certainly something worth clarifying:

Following BactDating, it should be the case that the molecular clock is in substitutions per genome per unit time. I hope that's helpful, I'll add this information to the manual.

nickjcroucher commented 2 years ago

Updated manual in 1777808af1dc9e7fe82402bd3ca45928ca82814b.

wanyuac commented 1 year ago

Thanks for the question, sorry for the slow response to the original issue - there are three different types of branch length in the output files, so this is certainly something worth clarifying:

* **[prefix].final_tree.tre** (and [prefix].tre files from each iteration): these branch lengths are in substitutions per genome, as you suggest @katholt - the relevant code is in [tree_scaling.c](https://github.com/nickjcroucher/gubbins/blob/a2abf6141ff2b249003099d16cf5436ac3e9b619/src/tree_scaling.c#L27), where the branch lengths (in substitutions per site) are multiplied by the number of recombination-filtered SNPs

* **[prefix].joint.tre**: this is an intermediate file, usually deleted, in which the branch lengths are the integer total number of base substitutions (SNPs and point mutations) reconstructed by the joint ancestral reconstruction algorithm

* **RAxML/IQtree/FastTree/RapidNJ intermediate files** - these retain branch lengths in the usual substitutions per site units

Following BactDating, it should be the case that the molecular clock is in substitutions per genome per unit time. I hope that's helpful, I'll add this information to the manual.

Thanks for the explanation. Do substitutions include indels?

nickjcroucher commented 1 year ago

No, the estimates are from the nucleotide substitution model you specified.