lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

Subcommand: graft vs. guppy tog #7

Closed justinshaffer closed 4 years ago

justinshaffer commented 5 years ago

Hello,

I am comparing two trees resulting from one set of placements from SEPP, one tree generated using gappa graft (fully resolved) and the other using guppy tog.

I observed unexpected differences between the trees based on estimates of tip-to-tip distances (e.g., a value of 0.04) and the Robinson-Fourd metric (e.g., a value of 2.0).

Do you know why the subcommand graft (fully resolved) would result in alternative topologies vs. guppy tog?

Thanks in advance.

cc: @antgonza

lczech commented 5 years ago

Hi Justin,

my guess is that the exact order in which the queries are attached along the branch might differ, for example if multiple queries have the same proximal/distal position on the branch.

Would you mind providing a minimal example, preferably with a small tree and a few queries that shows the behavior you describe?

Lucas

justinshaffer commented 5 years ago

Hi Lucas,

Sure thing. I've provided an example below. Apologies for the delay.

I've attached an archive including the following files:

Here is the python code that I used to convert the placements file for input to gappa:

import re

placements_just1000 = json.loads(open('just_1000_fragements.json').read())
regex = re.compile(r"\[(\d+)\]")
placements_just1000['tree'] = regex.sub(r'{\1}', placements_just1000['tree'])
placements_just1000['tree'] = placements_just1000['tree'].replace('UQrYOlnDN000988k__Bacteria', 'UQrYOlnDN000988k__Bacteria{9999999}')

encoded_just1000 = json.dumps(placements_just1000)
g = open('just_1000_fragements_gappa_fix.json', 'w')
g.write(encoded_just1000)
g.close()

Here is the bash code that I used to generate the trees from guppy and gappa:

guppy tog just_1000_fragments.json
gappa analyze graft --jplace-path just_1000_fragments_gappa_fix.json --fully-resolve --out-dir .

Here is the python code that I used to import the trees, compare tip-to-tip distances between the two trees, and calculate Robinson-Fould distance between the two trees:

import skbio

guppy_just1000 = skbio.TreeNode.read('just_1000_fragments.tog.tre')
gappa_just1000 = skbio.TreeNode.read('just_1000_fragments_gappa_fix.newick')
guppy_just1000.compare_rfd(gappa_just1000)

The last two lines of code above indicate an RFD of 4.0.

I then further explored if the variation stemmed from only the placements (i.e., insertions), so performed similar analyses of the trees pruned of the references:

guppy_just1000_tips_to_keep = {n.name for n in guppy_just1000.tips() if set(n.name).intersection({'A', 'C', 'T', 'G'})}
gappa_just1000_tips_to_keep = {n.name for n in gappa_just1000.tips() if set(n.name).intersection({'A', 'C', 'T', 'G'})}

gappa_just1000_tips_to_keep == guppy_just1000_tips_to_keep

gappa_just1000_sheared = gappa_just1000.shear(gappa_just1000_tips_to_keep)
guppy_just1000_sheared = guppy_just1000.shear(guppy_just1000_tips_to_keep)

guppy_just1000_sheared.compare_tip_distances(gappa_just1000_sheared)
guppy_just1000_sheared.compare_rfd(gappa_just1000_sheared)

The last two lines of code above indicate a tip distance of 0.03 and an RFD of 0.0.

Upon closer inspection of branch length values for the same nodes between the two trees, it appears that they have been re-computed differently by guppy vs. gappa (e.g., 0.33343449999999997 vs. 0.33343500000000004).

The non-zero RFD is slightly more concerning, especially given that we only observe this when the references, which should remain stable during SEPP, are included.

Just as a note, when I perform the same analyses on similar trees generated with a much larger set of placements (n > 1M), I observe between the guppy and gappa trees an RFD = 1120174.0, and with the references pruned (i.e., including only the placements as tips, as above) an RFD = 1029242.0. So, while the RFD for the trees with only 1000 placements is non-zero only when references and insertions are both included, for the larger set of placements we observe non-zero distances also for when references are pruned.

Can you please provide insight to why the branch lengths and topologies differ between trees produced by guppy vs. gappa?

Thanks in advance.

cc: @antgonza @wasade

guppy_gappa_issue.zip

lczech commented 5 years ago

Dear Justin,

thanks for the details. In fact, you are reporting two issues here:

  1. gappa not being able to read SEPP-generated jpace files, and
  2. gappa and guppy producing different trees from these files.

As for 1: This is fixed now in gappa v0.4.0. The issue was that SEPP apparently outputs an outdated, deprecated version of the jplace standard (see here and here for details), which I never encountered before, and for which gappa hence did not have support. I'm in contact with the authors of SEPP now, and try to convince them to update their output to a more recent version of the jplace standard. Also, I now added support for these old versions, so that it now works with both jplace files that you included. Furthermore, the reason why guppy did not work with your "fixed" file is that you forgot to change the "version" field in the file - guppy saw that it was the old version 1, but your fix basically converted it to version 3, which seems to have confused guppy. Also, next time, please report such issues on their own ;-)

As for 2: With the files that you sent, I cannot really investigate what happens. Your tree contains >400k branches, and in your examples, you only placed 1k sequences on them. Most of them are scattered across all branches. In fact, from visual inspection, I only found one pair of query sequences that were placed on the same branch of the reference tree:

TACGGAGGGTGCAAACGTTGCTCGGAATCATTGGGCGTAAAGCGCGCGTAGGCGGCCATGTAAGTTGGATGTGAAAGCCCAGGGCTCAACCCTGGAAGTGCACCCAAAACTGCGTGGCTTGAGTACCGGAGGGGCTCAGAGAATTCCCGG
TACGGAGGGTGCAAACGTTGCTCGGAATCATTGGGCGTAAAGCGCACGTAGGCGGCCAAGCAAGTTGGATGTGAAAGCCCAGGGCTCAACCCTGGAAGTGCACCCAAAACTGCATGGCTTGAGTCCCGGAGGGGCCCAGAGAATTCCCGG

Hence, this is the only example that I found where gappa and guppy could have differed, but in this case don't. So, not much for me to investigate here :-(

This also explains the RF distances. If you remove the 400k reference taxa, the remaining topology of the 1k query sequences does not differ between guppy and gappa, because the 1k query sequences are placed on different branches anyway, so there is no branch where guppy and gappa might differ in how they resolve these when building the graft tree. As a side note, when working with such large trees, it might make sense to use relative RF distance instead of absolute, in order to better judge how difference the trees really are.

Furthermore, I could not fully observe the branch length differences that you describe. Your exemplary numbers (0.33343449999999997 and 0.33343500000000004) do not occur in the data that you sent. The only difference that I could find is the number of significant digits that each of the programs outputs. While guppy seems to use 8 significant digits (e.g., 0.00765207), gappa only prints 6 digits (0.007652). I could increase this, but usually, this does not matter much. So, my guess is that the difference that you describe is simply due to the software that you used to investigate the files (skbio?). When reading floating point numbers, any software has to convert it to its inner data type for floats, and when printing this, you get these results with lots of trailing 9s or 0s, which however do not reflect the accuracy in the input file. Better directly have a look at the input file instead - this is where you will find the numbers that actually matter.

So, my guess again is that the difference is simply due to slight numerical differences. If you want to be sure what is going on, you however have to provide a dataset where I can actually observe the differences you describe. That is, a dataset where several query sequences are placed on the same branch, and resolved differently by guppy and gappa. If you find a suitable subset of your query sequences and post it here, I can have a further look.

Lastly, I think that what you are trying to achieve here is a bit risky anyways. If I understand correctly, you want to build a fully resolved tree for 1M sequences from just their placement on the 400k reference tree, right? I would highly advice against this. The resolution and robustness will be very very poor, as the placements are only resolved with respect to the reference tree, but not with respect to each other. That is, the tree will not be of good quality. Then again, a 1M sequences tree is really hard to come by, work with, visualize, compare, etc anyway. But I do not fully know what you are planning, so maybe it's fine.

So long Lucas

lczech commented 4 years ago

Hi Justin, as there seems to be no update on your issue, I'm going to close it now. If you need, feel free to re-open any time! Lucas