Read counting isn't correct

naobservatory / mgs-pipeline

MIT License

4 stars 2 forks source link

In collect-n-reads-single.sh I'd written:

aws s3 cp "$raw" - | gunzip | grep -c ^@ > "$n_reads"

This can overestimate of the number of reads in a fastq file if (a) the file was sequenced on a machine that can emit a quality score of @ (Q31) and (b) the first character in the read's quality score is @.

I'm going to replace this with a bit of python that uses FastqGeneralIterator and recalc.

My expectation is that most files are unchanged (most of the bases we have were sequenced on big Illuminas that only use four quality values and don't include @) and for the files that are changed there shouldn't be too much of a difference? But we'll see.

Study	Bioproject	Before	After	Delta	Ratio
Brinch-2020	PRJEB13832	120366010	117520141	-2845869	-2.36%
Brinch-2020	PRJEB34633	254297490	248641816	-5655674	-2.22%
Hendriksen-2019	PRJEB13833	956946171	784382857	-172563314	-18.03%
Hjelmso-2019	PRJEB30546	293134595	287585696	-5548899	-1.89%
Ng-2019	PRJNA438174	1765543240	1629559012	-135984228	-7.70%
Petersen-2015	PRJEB12466	1982488807	1509457199	-473031608	-23.86%
Spurbeck-2023	PRJNA924011	7369178	7369165	-13	0.00%

Study

Bioproject

Before

After

Delta

Ratio

Brinch-2020

PRJEB13832

120366010

117520141

-2845869

-2.36%

Brinch-2020

PRJEB34633

254297490

248641816

-5655674

-2.22%

Hendriksen-2019

PRJEB13833

956946171

784382857

-172563314

-18.03%

Hjelmso-2019

PRJEB30546

293134595

287585696

-5548899

-1.89%

Ng-2019

PRJNA438174

1765543240

1629559012

-135984228

-7.70%

Petersen-2015

PRJEB12466

1982488807

1509457199

-473031608

-23.86%

Spurbeck-2023

PRJNA924011

7369178

7369165

-13

0.00%

naobservatory / mgs-pipeline

Read counting isn't correct #18