Open ctb opened 4 years ago
This continues to be a problem when running 6-frame translated read searches against protein databases for classification. We know the % classified
will be incorrect, but I'm not sure we have enough information to produce a "correct" % classification, since we haven't properly evaluated the number of k-mers coming from incorrect ORF's that do map to reference databases.
We could:
orpheum
to find the correct ORF prior to running gather and compare results with 6-frame translation...especially a problem for downstream use of gather --> tax, e.g. krona output, where the 'fraction' reported is of the 6-frame translated sketch...
fraction superkingdom phylum class order family genus species
0.0337919997763739 Eukaryota Alveolata Dinophyta Dinophyceae Dinophyceae_X Dinophyceae_XX Kryptoperidinium
0.030124298839007847 Eukaryota Alveolata Dinophyta Dinophyceae Dinophyceae_X Dinophyceae_XX Alexandrium
0.028228912369629093 Eukaryota Alveolata Dinophyta Dinophyceae Dinophyceae_X Dinophyceae_XX Scrippsiella
0.018354810756415273 Eukaryota Alveolata Dinophyta Dinophyceae Dinophyceae_X Dinophyceae_XX Karenia
0.015011290011988844 Eukaryota Alveolata Dinophyta Dinophyceae Dinophyceae_X Suessiales Symbiodinium
in charcoal we are trying out 6-frame translations to do decontamination: https://github.com/dib-lab/charcoal/pull/120. The gather output that is reported is pretty lousy because it doesn't adjust for the (large) number of false negatives that comes from comparing a 6-frame translation signature to a database constructed with --input-is-protein.
I wonder if there's anything we can do about this? Seems ...tricky. I'm not even sure we currently track enough information to flag when this is happening!
relevant to #999