sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
464 stars 78 forks source link

how should we adjust output when using 6-frame-translated signatures? #1087

Open ctb opened 4 years ago

ctb commented 4 years ago

in charcoal we are trying out 6-frame translations to do decontamination: https://github.com/dib-lab/charcoal/pull/120. The gather output that is reported is pretty lousy because it doesn't adjust for the (large) number of false negatives that comes from comparing a 6-frame translation signature to a database constructed with --input-is-protein.

I wonder if there's anything we can do about this? Seems ...tricky. I'm not even sure we currently track enough information to flag when this is happening!

relevant to #999

bluegenes commented 2 years ago

This continues to be a problem when running 6-frame translated read searches against protein databases for classification. We know the % classified will be incorrect, but I'm not sure we have enough information to produce a "correct" % classification, since we haven't properly evaluated the number of k-mers coming from incorrect ORF's that do map to reference databases.

We could:

bluegenes commented 2 years ago

...especially a problem for downstream use of gather --> tax, e.g. krona output, where the 'fraction' reported is of the 6-frame translated sketch...

fraction        superkingdom    phylum  class   order   family  genus   species
0.0337919997763739      Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Kryptoperidinium
0.030124298839007847    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Alexandrium
0.028228912369629093    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Scrippsiella
0.018354810756415273    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Karenia
0.015011290011988844    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Suessiales      Symbiodinium