qiime2 / q2-fragment-insertion

BSD 3-Clause "New" or "Revised" License
13 stars 17 forks source link

Taxonomy #18

Open sjanssen2 opened 6 years ago

sjanssen2 commented 6 years ago

Improvement Description I thought about the FeatureData[Taxonomy] artifact and Daniel's warnings about the quality of the assigned taxonomic labels, which depend on the quality of the placements of taxonomic labels in the reference phylogeny. Furthermore, fragment insertion is not unambiguous, but results in a distribution of positions and I remember Siavash suggesting his program TIPP for taxonomy assignment. Thus, I think we better organize creation of a FeatureData[Taxonomy] as a separate function instead of integrating it into the main function ("sepp").

Proposed Behavior Currently, I am thinking about two alternatives to generate a FeatureData[Taxonomy]:

1) classify-paths: the current method which collects all taxonomic labels along the path from tip to root. Single input would be the Phylogeny[Rooted] artifact.

2) classify-otus: For every inserted fragment, we traverse the tree from tip to root. In every step, we check if we can find any OTU nodes in the current sub-tree. If so, we stop, otherwise continue the same procedure with the parent node. Once we found one (or maybe several) OTUs, we look up their assigned taxonomy lineage in Greengenes/Silva taxonomy table for corresponding reference tree. In case of several OTUs we report the longest commong prefix. This would require two inputs, the Phylogeny[Rooted] artifact and the taxonomy table from Greengenes with two columns: OTU-ID and lineage-string. This is the more conservative method and should only produce results en par with current Greengenes based taxonomy assignment algorithms.

3) classify-tipp: A feature development could use Siavash's TIPP to generate taxonomic lineages.

Questions @wasade what are your thoughts?

wasade commented 6 years ago

Sounds interesting

On Fri, Nov 17, 2017 at 8:06 AM, Stefan Janssen notifications@github.com wrote:

I thought about the FeatureData[Taxonomy] artifact and Daniel's warnings about the quality of the assigned taxonomic labels, which depend on the quality of the placements of taxonomic labels in the reference phylogeny. Furthermore, fragment insertion is not unambiguous, but results in a distribution of positions and I remember Siavash suggesting his program TIPP for taxonomy assignment. Thus, I think we better organize creation of a FeatureData[Taxonomy] as a separate function instead of integrating it into the main function ("sepp").

Currently, I am thinking about two alternatives to generate a FeatureData[Taxonomy]:

1.

classify-paths: the current method which collects all taxonomic labels along the path from tip to root. Single input would be the Phylogeny[Rooted] artifact. 2.

classify-otus: For every inserted fragment, we traverse the tree from tip to root. In every step, we check if we can find any OTU nodes in the current sub-tree. If so, we stop, otherwise continue the same procedure with the parent node. Once we found one (or maybe several) OTUs, we look up their assigned taxonomy lineage in Greengenes/Silva taxonomy table for corresponding reference tree. In case of several OTUs we report the longest commong prefix. This would require two inputs, the Phylogeny[Rooted] artifact and the taxonomy table from Greengenes with two columns: OTU-ID and lineage-string. This is the more conservative method and should only produce results en par with current Greengenes based taxonomy assignment algorithms. 3.

classify-tipp: A feature development could use Siavash's TIPP to generate taxonomic lineages.

@wasade https://github.com/wasade what are your thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/q2-fragment-insertion/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8srLDdSZIDoZFthh9ZlZJg9R21U74ks5s3a8PgaJpZM4QiQXj .

sjanssen2 commented 6 years ago

I think I want to change classify-paths to operate on two inputs, the insertion tree Phylogeny[Rooted] AND the representative-sequences FeatureData[Sequence] to ensure collecting lineages only for the inserted tips. Otherwise, one would need to guess which tips belong to the reference phylogeny and which are inserted fragments (which might work as long as fragment names are nucleotide sequences), but since we allow arbitrary reference phylogenies we cannot ensure that no other tip names are only composed of "acgt" characters.