naobservatory / mgs-pipeline

MIT License
4 stars 2 forks source link

Read counting isn't correct #18

Closed jeffkaufman closed 9 months ago

jeffkaufman commented 9 months ago

In collect-n-reads-single.sh I'd written:

aws s3 cp "$raw" - | gunzip | grep -c ^@ > "$n_reads"

This can overestimate of the number of reads in a fastq file if (a) the file was sequenced on a machine that can emit a quality score of @ (Q31) and (b) the first character in the read's quality score is @.

I'm going to replace this with a bit of python that uses FastqGeneralIterator and recalc.

My expectation is that most files are unchanged (most of the bases we have were sequenced on big Illuminas that only use four quality values and don't include @) and for the files that are changed there shouldn't be too much of a difference? But we'll see.

jeffkaufman commented 9 months ago

Here's what I see for changes after fixing this:

Study Bioproject Before After Delta Ratio
Brinch-2020 PRJEB13832 120366010 117520141 -2845869 -2.36%
Brinch-2020 PRJEB34633 254297490 248641816 -5655674 -2.22%
Hendriksen-2019 PRJEB13833 956946171 784382857 -172563314 -18.03%
Hjelmso-2019 PRJEB30546 293134595 287585696 -5548899 -1.89%
Ng-2019 PRJNA438174 1765543240 1629559012 -135984228 -7.70%
Petersen-2015 PRJEB12466 1982488807 1509457199 -473031608 -23.86%
Spurbeck-2023 PRJNA924011 7369178 7369165 -13 0.00%

Overall this is very slightly good news: since our denominator is lower it means the relative abundance of human viruses is higher than we'd thought. But only slightly, since most studies aren't affected and the ones that are affected aren't ones we've been drawing heavily on.