mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
149 stars 27 forks source link

Interpretation of branch length in the inferred tree #78

Closed amizeranschi closed 1 year ago

amizeranschi commented 1 year ago

Hi! I am using Ragout in order to expand a draft assembly of a plant genome, relative to a chromosome-level one assembly for a related species. I have aligned the 2 genomes with Cactus and am using the HAL assembly as input for Ragout.

I was wondering what the interpretation should be regarding the branch lengths from the inferred phylogenetic tree. In Cactus, these should represent the number of substitutions-per-site, but in Ragout I see much larger values that should be expected, given that interpretation.

I ran some tests based on the evolverMammals example from Cactus. I have replaced the input tree, maintaining the structure but removing the original branch lengths, which leads Cactus to use the default value of 1 for each branch:

((simHuman_chr6,(simMouse_chr6,simRat_chr6)),(simCow_chr6,simDog_chr6));

From the resulting HAL assembly, Ragout inferred the following tree:

((simHuman_chr6:52.0,(simCow_chr6:47.75,simDog_chr6:64.25):28.0):21.75,(simMouse_chr6:36.5,simRat_chr6:31.5):21.75);

For comparison, I converted the HAL assembly to MAF and used this with the phyloFit command from Phast, to infer the tree. This was the result:

((simHuman_chr6:0.125059,(simMouse_chr6:0.0795162,simRat_chr6:0.0858094):0.244229):0.0296015,(simCow_chr6:0.172295,simDog_chr6:0.148977):0.0296015);

Setting aside the difference in the order of magnitude of the branches, the relative order of magnitude between the branches within the 2 trees is also different. This can be seen by visualizing the 2 trees side by side in www.phylo.io

Interestingly, the tree inferred by Phast is very similar to the original tree from the evolverMammals example:

((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974):0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303):0.032898);

In that case, why is Ragout's tree so different? And does this have the potential of including bias into Ragout's assembly?

For a final test, I've also ran the evolverMammals example with a star-shaped guide tree, having all the genomes on the same level:

(simHuman_chr6,simMouse_chr6,simRat_chr6,simCow_chr6,simDog_chr6);

In this scenario, the tree inferred by Ragout was even more different than what one would expect:

((simCow_chr6:51.75,(simHuman_chr6:60.8333333333,simMouse_chr6:3.16666666667):11.75):2.625,(simDog_chr6:64.5,simRat_chr6:1e-06):2.625);
mikolmogorov commented 1 year ago

Hello,

Ragout trees are inferred based on breakpoint distances (e.g. inversions or translocations breakpoints), rather than substation distances. In general, this measure is correlated with substitution rates, but not always, e.g. there are known examples of accelerated accumulation of structural variants along some branches of the phylogenetic tree. Also note that the tree reconstructed by Ragout is unrooted. Since Ragout is mostly concerned about the structural differences between the genomes, a tree reconstructed based on breakpoints may be the better choice, even if it does not fully agree with substitution-based tree.

Best, Mikhail

amizeranschi commented 1 year ago

Great, thank you for the response. Everything is clear now.