Closed jeffkaufman closed 9 months ago
Here's what I see for changes after fixing this:
Study | Bioproject | Before | After | Delta | Ratio |
---|---|---|---|---|---|
Brinch-2020 | PRJEB13832 | 120366010 | 117520141 | -2845869 | -2.36% |
Brinch-2020 | PRJEB34633 | 254297490 | 248641816 | -5655674 | -2.22% |
Hendriksen-2019 | PRJEB13833 | 956946171 | 784382857 | -172563314 | -18.03% |
Hjelmso-2019 | PRJEB30546 | 293134595 | 287585696 | -5548899 | -1.89% |
Ng-2019 | PRJNA438174 | 1765543240 | 1629559012 | -135984228 | -7.70% |
Petersen-2015 | PRJEB12466 | 1982488807 | 1509457199 | -473031608 | -23.86% |
Spurbeck-2023 | PRJNA924011 | 7369178 | 7369165 | -13 | 0.00% |
Overall this is very slightly good news: since our denominator is lower it means the relative abundance of human viruses is higher than we'd thought. But only slightly, since most studies aren't affected and the ones that are affected aren't ones we've been drawing heavily on.
In
collect-n-reads-single.sh
I'd written:This can overestimate of the number of reads in a fastq file if (a) the file was sequenced on a machine that can emit a quality score of
@
(Q31) and (b) the first character in the read's quality score is@
.I'm going to replace this with a bit of python that uses
FastqGeneralIterator
and recalc.My expectation is that most files are unchanged (most of the bases we have were sequenced on big Illuminas that only use four quality values and don't include @) and for the files that are changed there shouldn't be too much of a difference? But we'll see.