matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
72 stars 17 forks source link

errors with guppy toc and large trees #366

Open antgonza opened 5 years ago

antgonza commented 5 years ago

Hope this is the right place for this issue.

Anyway, as part of our Qiita archive releases (More Info -> BETA: download Archive files) we are building trees that contain all the fragments processed in the system. However, during the last release we encounter an error with guppy toc but we can determine the source.

A bit more background: every single deblur sequence generated in the system, is processed via fragment-insertion/SEPP, and the placements are stored in the DB; monthly we retrieve all those sequences placements and we generate a full tree. The current issue is the last step in SEPP where guppy toc simply fails with no visible error. Note that this is the first month we do this and we have ran with extra memory (used vs. requested). Also, that all these placements have been added to a tree for each individual processed data but fails when doing the full dump - IMOO pointing to size.

The file we are having issues with can be found here:

Any help will be greatly appreciated.

cc: @sjanssen2, @smirarab

matsen commented 5 years ago

@antgonza Thank you for the model bug report.

However, I'm afraid that this won't get fixed. Pplacer development stopped years ago and this sounds non-trivial. I suggest you explore the Stamatakis lab's recent work such as http://genesis-lib.org/ and their EPA-NG.

antgonza commented 5 years ago

Thank your for your prompt reply and kind words. Sad news but I get it.

Out of curiosity, is there a "translator" of commands and formats (not sure if needed) from pplacer/guppy to genesis-lib?

matsen commented 5 years ago

I don't know of one. You should ping them!

matsen commented 5 years ago

I don't really get your use case, but I feel the need to remind you that pplacer is not a typical phylogenetic inference program and was never meant to be one. It sounds to me like you are trying to use it to build a full tree.

Our perspective is that it was to place sequences on a tree, with the result being the tree with a collection of placements on it, which we thought of as a bunch of marker points on the tree. One then analyzes that object.

To drive the difference home, if we have identical sequences they will get placed at the same location on the tree. That's reasonable. But if you do guppy tog on them you will get two branches with a non-trivial pendant branch length. That's silly. One should have instead a single branch off the reference tree with the two sequences attaching to that branch with zero branch length.

To summarize, tog was put in somewhat begrudgingly so people could "see" placements with normal tree viz software. I don't think it should be used in production.