lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

Comprehending LCA choice (gappa edit accumualte) #14

Closed FWittmers closed 3 years ago

FWittmers commented 3 years ago

Hey Lucas,

I ran into something that I do not fully understand regarding the prediction of the LCA with gappa edit accumulate. I have placements on a phylogenetic tree through EPA-ng for the sequence. Here the extract from the jplace results:

    {"p": [
      [765, -38520.0071789560, 0.1429501017, 0.0000001074, 0.0337700903],
      [767, -38520.0071827704, 0.1429495565, 0.0000001246, 0.0337700933],
      [764, -38520.0071859950, 0.1429490955, 0.0000000996, 0.0337700946],
      [763, -38520.0080773995, 0.1428217268, 0.0155884715, 0.0337776025],
      [762, -38520.0080777812, 0.1428216723, 0.0051511020, 0.0337744692],
      [766, -38520.0081063836, 0.1428175873, 0.0051348303, 0.0337704137],
      [768, -38520.0136246843, 0.1420316475, 0.0000500000, 0.0337666203]
      ],
    "n": ["seq19365"]

and drawn (poorly) on a subset of the tree (also from the same jplace file, exported as newick tree) would come down to the 7 positions I marked with dots, if I understand this correcty. Black are node labels, light blue are edges.

Screen Shot 2021-01-08 at 00 29 12

If I analyze this jplace file with gappa edit accumulate (--threshold 0.9), it predicts:

 "p": [
                [ 768, 0, 1, 0, 0.0337713 ]
            ],
            "n": [ "seq19365" ]

as the LCA placements, which is highlighted in green, but shouldn't the red edge reflect the LCA edge? Maybe I am missing something here...

Also imported these results into my workflow, but wanted to keep it to the source in this case so I can make sure it is not some other tool.

Best, Fabian

lczech commented 3 years ago

Hey Fabian,

that might be a bug... Would you mind sharing a full jplace file here? Ideally, with just that one query in it, but also with the tree and the other non-query json entries. That would greatly help in debugging!

Thanks a lot Lucas

FWittmers commented 3 years ago

Hey Lucas,

I attached the EPA-ng JPLACE and the gappa JPLACE. Note that the LWR differs slightly, because I did remove like 3000 other queries and reran it for this, but the outcome and placements are still identical. The tree should be in both files. Hope it helps. Also attached the log files of both commands, in case that is of any help. Please let me know if there is anything else I can do!

epa_result.jplace.txt LCA_accumulated.jplace.txt epa_info.log LCA.log

Best, Fabian

lczech commented 3 years ago

Hey Fabian,

it seems that the tree that you so meticulously drew there does not match the tree in the jplace file. The respective excerpt from the tree in your files is:

((CATbAT00:0.005156{762},UncB1640:0.015593{763})0.0000000000:0{764},(CDADBA02:0{765},UncBa448:0.00514{766})0.0000000000:0{767})0.9140000000:0.015373{768}

which looks like this

image

(visualizing your jplace file with https://itol.embl.de/). When looking at the accumulated file, I get this:

image

which seems correct to me. I also repeated the visualizations with gappa examine heat-tree to make sure, and get the same results.

So it seems, the issue (luckily for me) is not a bug in gappa, but rather that the tree you used for placement differs from the one that you used to draw the conclusions that led you to open this issue. Would you mind checking this and letting me know if that is the case. If not, we have to further investigate.

Cheers and so long Lucas

FWittmers commented 3 years ago

Hey Lucas,

Thanks a ton for looking into this so soon, I greatly appreciate it. My visualisation was in ggtree using some Lab-internal pipeline from the stone-ages. The info you give here help me find multiple bugs the interal scripts used to calculate LCA, I agree that gappa is giving the correct results. ... At least have a point to overthrow that outdated pipeline and use gappa in my new one now 👍. Really enjoying the possibilities of EPA-ng and gappa in combination btw!

Best, Fabian

lczech commented 3 years ago

Thanks for the feedback, and glad to hear!

So, I have so many cool features planned for gappa, but I moved on to a different field of research, and only sporadically add new stuff to gappa. But if you have cool ideas for tools that are of general usage, please let me know!

Also, if you are okay to work in C++, I would (shamelessly) recommend to have a look at our library genesis, which can also do things like LCA and so on. Probably has a higher learning curve and less documentation than ggtree, but you gain speed and features. Also, I'm happy to hint and help.

Lucas