Open wasade opened 2 years ago
thanks @wasade I think this is basically a duplicate of #103 ... orientation autodetection is done on the first 100 seqs I believe, so we see problems like this esp. if reads are in mixed orientations or there are a few noisy or non-target reads (even just a few bad seeds).
would it make sense to test both orientations and retain the one with higher confidence?
this is what is done on the first 100 reads to autodetect. It has been discussed for some time whether this should instead be done on all reads, i.e., test both orientations and pick the one that looks most reasonable. As your example shows, usually only one orientation looks reasonable and the wrong orientation is usually classified at domain level or unclassified.
It would increase runtime but I would personally be in favor of this (or adding a both
orientation option, so the current autodetect on a subsample remains as an option), as it would fix a few old thorns in our side regarding autodetection and mixed orientations. Would you be interested in adding this?
Thanks, @nbokulich. Unfortunately, I will be going on paternity leave soon and my bandwidth to understand unfamiliar codebases is quite limited. What I recommend considering is using both
by default, erring on the side of caution, as the consequence of auto detection being wrong could be bad.
congrats! that makes two of us — maybe @BenKaehler would like to take up the task?
I agree that both
should be the new default. But we could keep the original autodetect
as an option for those who liked that behavior.
Thanks!! And congrats to you as well!
I think that sounds like a great plan
@nbokulich, @wasade - anyone interested in picking this up? I think this is also a good first issue, so I'll tag it with that for Hacktoberfest folks or other new developers.
yeah I agree, hacktoberfest
Bug Description When classifying a 23M feature set, and separately a 20M subset, it was observed that the number of reported Archaea differed by two orders of magnitude (4k in 23M and 400k in 20M).
The behavior was observed for both Greengenes and SILVA with QIIME 2 2022.2.
On Slack, @BenKaehler kindly suggested testing the
--p-read-orientation
parameter with an individual sequence.In the below example, a single 90nt sequence from the 2017 EMP paper, originally classified to the order level within Archaea, is tested with both
same
andreverse-complement
settings. In the reverse complement case, we observe the sequence being classified ask__Bacteria
with high confidence.Steps to reproduce the behavior
Expected behavior The result of classifying an Archaea with high confidence as a Bacteria was surprising. However, the user is not presented with an indication this may be the case. Given the how fast classification occurs, and the risk incorrect results presents to a user, would it make sense to test both orientations and retain the one with higher confidence?
Computation Environment
References