Read orientation auto detection failure

wasade commented 2 years ago

Bug Description When classifying a 23M feature set, and separately a 20M subset, it was observed that the number of reported Archaea differed by two orders of magnitude (4k in 23M and 400k in 20M).

The behavior was observed for both Greengenes and SILVA with QIIME 2 2022.2.

On Slack, @BenKaehler kindly suggested testing the --p-read-orientation parameter with an individual sequence.

In the below example, a single 90nt sequence from the 2017 EMP paper, originally classified to the order level within Archaea, is tested with both same and reverse-complement settings. In the reverse complement case, we observe the sequence being classified as k__Bacteria with high confidence.

Steps to reproduce the behavior

$ md5sum ../../gg-13-8-99-515-806-nb-classifier.qza
9e82e8969303b3a86ac941ceafeeac86  ../../gg-13-8-99-515-806-nb-classifier.qza
$ cat example.fna
>example
CACCGGCGGCTCGAGTGGTGGCCGCTATTATTGGGCTTAAAGGGTCCGTAGCCGGGCCAGTTAGTCCCTTGGGAAATCTTACGGCTTAAC
$ qiime tools import --input-path example.fna --output-path example.fna.qza --type FeatureData[Sequence]
Imported example.fna as DNASequencesDirectoryFormat to example.fna.qza
$ qiime feature-classifier classify-sklearn \
    --i-reads example.fna.qza \
    --i-classifier ../../gg-13-8-99-515-806-nb-classifier.qza \
    --o-classification example.classified.same.qza \
    --p-read-orientation same
Saved FeatureData[Taxonomy] to: example.classified.same.qza
$ qiime feature-classifier classify-sklearn \
    --i-reads example.fna.qza \
    --i-classifier ../../gg-13-8-99-515-806-nb-classifier.qza \
    --o-classification example.classified.rc.qza \
    --p-read-orientation reverse-complement
Saved FeatureData[Taxonomy] to: example.classified.rc.qza
$ qiime tools export --input-path example.classified.same.qza --output-path example.classified.same
Exported example.classified.same.qza as TSVTaxonomyDirectoryFormat to directory example.classified.same
$ qiime tools export --input-path example.classified.rc.qza --output-path example.classified.rc
Exported example.classified.rc.qza as TSVTaxonomyDirectoryFormat to directory example.classified.rc
$ cat example.classified.same/taxonomy.tsv 
Feature ID  Taxon   Confidence
example k__Archaea; p__Euryarchaeota; c__Methanomicrobia; o__Methanocellales; f__; g__; s__ 0.999986467281595
$ cat example.classified.rc
example.classified.rc/     example.classified.rc.qza  
$ cat example.classified.rc/taxonomy.tsv 
Feature ID  Taxon   Confidence
example k__Bacteria 0.9709375640478621

Expected behavior The result of classifying an Archaea with high confidence as a Bacteria was surprising. However, the user is not presented with an indication this may be the case. Given the how fast classification occurs, and the risk incorrect results presents to a user, would it make sense to test both orientations and retain the one with higher confidence?

Computation Environment

$ cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)
$ uname -a
Linux barnacle2 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ qiime info
System versions
Python version: 3.8.12
QIIME 2 release: 2022.2
QIIME 2 version: 2022.2.0
q2cli version: 2022.2.0

Installed plugins
alignment: 2022.2.0
composition: 2022.2.0
cutadapt: 2022.2.0
dada2: 2022.2.0
deblur: 2022.2.0
demux: 2022.2.0
diversity: 2022.2.0
diversity-lib: 2022.2.0
emperor: 2022.2.0
feature-classifier: 2022.2.0
feature-table: 2022.2.0
fragment-insertion: 2022.2.0
gneiss: 2022.2.0
longitudinal: 2022.2.0
metadata: 2022.2.0
phylogeny: 2022.2.0
quality-control: 2022.2.0
quality-filter: 2022.2.0
sample-classifier: 2022.2.0
taxa: 2022.2.0
types: 2022.2.0
vsearch: 2022.2.0

Application config directory
/home/mcdonadt/miniconda3/envs/qiime2-2022.2/var/q2cli

Getting help
To get help with QIIME 2, visit https://qiime2.org

References

Slack #developers channel, April 14th

nbokulich commented 2 years ago

thanks @wasade I think this is basically a duplicate of #103 ... orientation autodetection is done on the first 100 seqs I believe, so we see problems like this esp. if reads are in mixed orientations or there are a few noisy or non-target reads (even just a few bad seeds).

would it make sense to test both orientations and retain the one with higher confidence?

this is what is done on the first 100 reads to autodetect. It has been discussed for some time whether this should instead be done on all reads, i.e., test both orientations and pick the one that looks most reasonable. As your example shows, usually only one orientation looks reasonable and the wrong orientation is usually classified at domain level or unclassified.

It would increase runtime but I would personally be in favor of this (or adding a both orientation option, so the current autodetect on a subsample remains as an option), as it would fix a few old thorns in our side regarding autodetection and mixed orientations. Would you be interested in adding this?

wasade commented 2 years ago

Thanks, @nbokulich. Unfortunately, I will be going on paternity leave soon and my bandwidth to understand unfamiliar codebases is quite limited. What I recommend considering is using both by default, erring on the side of caution, as the consequence of auto detection being wrong could be bad.

nbokulich commented 2 years ago

congrats! that makes two of us — maybe @BenKaehler would like to take up the task?

I agree that both should be the new default. But we could keep the original autodetect as an option for those who liked that behavior.

wasade commented 2 years ago

Thanks!! And congrats to you as well!

I think that sounds like a great plan

gregcaporaso commented 1 year ago

@nbokulich, @wasade - anyone interested in picking this up? I think this is also a good first issue, so I'll tag it with that for Hacktoberfest folks or other new developers.

nbokulich commented 1 year ago

yeah I agree, hacktoberfest

qiime2 / q2-feature-classifier

Read orientation auto detection failure #177