lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Systematic bias at low coverage (under 20%) #45

Open tylerbarnum opened 3 years ago

tylerbarnum commented 3 years ago

(For others who come across this: this is an issue with an edge use case of Nonpareil; I’m otherwise very happy with the program and trust it for higher coverage samples).

I designed an experiment to see how the output of Nonpareil changes when a FASTQ is repeatedly halved in size. The behavior above a redundancy value of 20% is that the subsampled FASTQ files follow the Nonpareil curve of the larger FASTQ file. Under 20%, however, the data show a systematic bias towards low redundancy (an example of the data is shown within the affected range in the below plot). The bias affects estimates of diversity and how much additional sequencing effort is needed. I suspect that the issue may be, using the language in the original paper, in the assumptions behind how the total number of reads affects the probability of observing matches between reads. At low total number of reads, it becomes less and less likely to find matches between reads; is the binomial distribution still appropriate in such a context?

image