pierrebarbera / epa-ng

Massively parallel phylogenetic placement of genetic sequences
GNU Affero General Public License v3.0
77 stars 7 forks source link

Intuition for a common taxonomic-assignment procedure #31

Closed adriaaula closed 5 years ago

adriaaula commented 5 years ago

Hello Pierre,

I have a conceptual problem. For a set of query sequences coming from metagenomic origin, I want to know the taxonomy. I have a reference tree with ~600 sequences. Should I:

I think the latter is a feasible option, but I have one major concern. If I have 400 query sequences, wouldn't be this a case of overfitting? Maybe some environmental sequences form a new cluster by themselves. So, what should be the most common approximation to that.

Additonally, I also have an amplicon dataset coming from the same metagenomic sequences, covering a smaller region. Should I perform the placement of these queries with the whole tree (references + query metagenomic seqs) or this again could be considered overfitting and I should work only with the reference (which are in fact the origin for the taxonomy).

EPA-ng is really fast and useful, thank you for working on it :)

pierrebarbera commented 5 years ago

Hi Adria, great to hear from you!

So Phylogenetic Placement was made specifically to avoid your first case, namely making a tree mixed from long, high quality reference sequences, and short, lower quality query sequences from NGS. One fear when mixing both sets is that the low quality sequences could behave like rogue taxa, as they don't have a lot of phylogenetic signal due to their length (though I would say 600bp is definitely on the long end of short reads). The other motivator was the practical impossibility of trying to infer a tree of tens of thousands to millions of query sequences from a metabarcoding sample, though as you say with 400 query sequences, that's not so much of an issue.

Still, I would recommend you use placement, and as you have so few sequences you probably want to use the thorough, slightly more accurate version enabled via --no-heur. I'm not entirely sure what you mean by overfitting, maybe you can clarify that. If you mean that just placing them against the reference is too "coarse", then yes there will always be phylogenetic structures between the queries that landed on a given branch. One approach would be to take just those sequences per branch, and build a tree out of those. However that wouldn't give you anything in terms of taxonomic assignment, as that is tied to the tree as well.

As for the other placement, again I wouldn't mix queries into your reference tree, that needlessly complicates things... unless you specifically want to show interactions between the two datasets, in detail. If it's just for taxonomic assignment, then I wouldn't bother.

Lastly, I'm sure you already had a look since you mentioned gappa, but the gappa examine assign command is what you want to use. I also just now pushed an update to it which you should use; I think there was a bug that crept in in one of the more recent commits that caused the reference tree to be incorrectly labelled sometimes. Also it now includes an option to automatically resolve missing taxonomic annotations in the reference tree (in case you do want to go with a mixed approach).

I also have a genesis app to perform taxonomic assignment in a very similar way to gappa examine assign, which we used in a recent paper: https://github.com/Pbdas/long-reads/blob/master/src/partial-tree-taxassign.cpp (paper: https://www.biorxiv.org/content/10.1101/627828v1)

Don't hesitate to ask more questions, and Happy Placement :) Pierre

pierrebarbera commented 5 years ago

Also I guess the more appropriate channel for this kind of question would be our placement-googlegroup: https://groups.google.com/forum/#!forum/phylogenetic-placement

If you don't mind, I will copy this thread over to it, and close this issue here

Edit: https://groups.google.com/forum/#!topic/phylogenetic-placement/JJbTtGx-i34

adriaaula commented 5 years ago

Thank you Pierre for the fast response. Way clearer now (I also had a talk with a common friend today)!