peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Minimum abundance as percentage (fraction) #374

Closed peterjc closed 2 years ago

peterjc commented 3 years ago

Would be useful for high or variable coverage datasets, and better matching the published analysis of some of the existing worked examples.

e.g. Quoting the Muri et al. (2020) paper used as our drained_ponds example:

A low-frequency noise threshold of 0.001 (0.1%) was applied across the dataset to reduce the probability of false positives arising from cross-contamination or tag-jumping (De Barba et al. 2014; Hänfling et al. 2016). Based on the level of contamination found in sampling/filtration blanks and PCR negatives, a second arbitrary threshold was applied and all records occurring with less than 50 reads assigned were removed.

That seems like a sensible default (based on our own tree nursery data), although maybe leave the default as zero (off).

Would probably supplement the current absolute setting:

  -a ABUNDANCE, --abundance ABUNDANCE
                        Minimum abundance applied to unique marker sequences
                        in each sample (i.e. each FASTQ pair), default 100.
                        May be increased based on negative controls.

with something like:

  -f FRACTION, --abundance-fraction FRACTION
                        Minimum abundance fraction, low frequency noise threshold
                        applied to unique marker sequences in each sample. Default
                        0.001 (i.e. 0.1%).

or:

  -p PERCENTAGE, --percent-abundance PERCENTAGE
                        Minimum abundance percentage, low frequency noise threshold
                        applied to unique marker sequences in each sample. Should be
                        under 10%, default 0.1 meaning 0.1%.

Note -f is currently in use for a minor metadata setting.

peterjc commented 3 years ago

I would use this in the soil_nematodes worked example, where one sample currently has over 2 million accepted reads (almost 2000 unique sequences with a maximum sample read abundance over 350k).

peterjc commented 3 years ago

Palmer et al. (2018) (cross reference #319) does something similar with a (synthetic control guided) percentage threshold to remove indexing switching (aka index bleed), but done at OTU level.

peterjc commented 2 years ago

OK, merging #376 gets us most of the way - but defaults to off, and does not interact with the negative controls to automatically raise the threshold.

However, inferring a fractional threshold from a negative control only really makes sense if the negative control contains an identifiable spike-in (e.g. synthetic reads as per our protocols, or an out-group genus). We might then say if you get 95% spike-in and 5% other, within which the most common unwanted read contributed 3%, that the threshold should be raised to 3%. Otherwise any reads in the negative control would all be unwanted, and might even give a 100% threshold, which is useless.

So, a dynamic fractional threshold would have to be conditional on negative controls with a spike-in sequence, currently specified via -y GENUS or --synthetic GENUS.

I'm thinking the auto-threshold might be configurable for the absolute value only (historic behaviour), fractional value only, or both. When all the samples on a plate are giving roughly the same yield, that makes little difference. However, when that isn't true, adjusting the fractional value seems better... but you need suitable controls.

Probably need to explore this on our existing in-house data and/or any public dataset with spike-in negative controls.

peterjc commented 2 years ago

See #405, changing from -a 100 (current default) to -a 100 -f 0.001 makes almost no difference for the recycled_water example.

peterjc commented 2 years ago

Looking at this paper with synthetic ITS1 like sequences for use in a mock community:

Palmer et al. (2018) Non-biological synthetic spike-in controls and the AMPtk software pipeline improve mycobiome data https://doi.org/10.7717/peerj.4925

Their initial over-estimate for Illumina tag switching was between 0.233% and 0.264% (multiple potential sequence sources with environmental samples). Using default Illumina de-multiplexing (allowing one mismatch in the index sequence), they found tag-switching using the synthetic mock community was 0.057% (can be sure where these reads came from).

So a default of 0.1% seems reasonable to exclude most tag-switching.

I suspect risks of cross contamination vary dramatically with protocol and laboratory, far more so than the standardised Illumina tagging.

peterjc commented 2 years ago

I am currently thinking that the command line should accept two (possibly overlapping) lists of control samples: