Closed peterjc closed 2 years ago
I would use this in the soil_nematodes
worked example, where one sample currently has over 2 million accepted reads (almost 2000 unique sequences with a maximum sample read abundance over 350k).
Palmer et al. (2018) (cross reference #319) does something similar with a (synthetic control guided) percentage threshold to remove indexing switching (aka index bleed), but done at OTU level.
OK, merging #376 gets us most of the way - but defaults to off, and does not interact with the negative controls to automatically raise the threshold.
However, inferring a fractional threshold from a negative control only really makes sense if the negative control contains an identifiable spike-in (e.g. synthetic reads as per our protocols, or an out-group genus). We might then say if you get 95% spike-in and 5% other, within which the most common unwanted read contributed 3%, that the threshold should be raised to 3%. Otherwise any reads in the negative control would all be unwanted, and might even give a 100% threshold, which is useless.
So, a dynamic fractional threshold would have to be conditional on negative controls with a spike-in sequence, currently specified via -y GENUS
or --synthetic GENUS
.
I'm thinking the auto-threshold might be configurable for the absolute value only (historic behaviour), fractional value only, or both. When all the samples on a plate are giving roughly the same yield, that makes little difference. However, when that isn't true, adjusting the fractional value seems better... but you need suitable controls.
Probably need to explore this on our existing in-house data and/or any public dataset with spike-in negative controls.
See #405, changing from -a 100
(current default) to -a 100 -f 0.001
makes almost no difference for the recycled_water example.
Looking at this paper with synthetic ITS1 like sequences for use in a mock community:
Palmer et al. (2018) Non-biological synthetic spike-in controls and the AMPtk software pipeline improve mycobiome data https://doi.org/10.7717/peerj.4925
Their initial over-estimate for Illumina tag switching was between 0.233% and 0.264% (multiple potential sequence sources with environmental samples). Using default Illumina de-multiplexing (allowing one mismatch in the index sequence), they found tag-switching using the synthetic mock community was 0.057% (can be sure where these reads came from).
So a default of 0.1% seems reasonable to exclude most tag-switching.
I suspect risks of cross contamination vary dramatically with protocol and laboratory, far more so than the standardised Illumina tagging.
I am currently thinking that the command line should accept two (possibly overlapping) lists of control samples:
Would be useful for high or variable coverage datasets, and better matching the published analysis of some of the existing worked examples.
e.g. Quoting the Muri et al. (2020) paper used as our
drained_ponds
example:That seems like a sensible default (based on our own tree nursery data), although maybe leave the default as zero (off).
Would probably supplement the current absolute setting:
with something like:
or:
Note
-f
is currently in use for a minor metadata setting.