rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
275 stars 33 forks source link

wrong file size of output fastq files #28

Closed millerh1 closed 4 years ago

millerh1 commented 4 years ago

Hello! It looks like parallel-fastq-dump creates fastq files which are not the same size as the fastq files created by fasterq-dump.

For example:

prefetch SRR5683211 -O output
parallel-fastq-dump -t 80 -O output/parallel-fastq --tmpdir tmp/ -s output/SRR5683211.sra --split-files
fasterq-dump -e 80 -O output/fasterq -S output/SRR5683211.sra

Fasterq size:

du -h output/fasterq/SRR5683211.sra_1.fastq
7.8G    output/fasterq/SRR5683211.sra_1.fastq

Parallel fastq size:

du -h output/parallel-fastq/SRR5683211_1.fastq
7.6G    output/parallel-fastq/SRR5683211_1.fastq

I've noticed this before and sometimes the data loss is very large -- especially when I don't use prefetch first. Do you know what may be causing this behavior?

rvalieris commented 4 years ago

hello,

I just downloaded the same SRR and did some tests.

first thing to notice is the number of lines/records in each file. they both have 138967296 lines and 34741824 fastq records. so no record is missing.

so why they have different sizes ? compare the first line from each file:

$ head -n1 a1/SRR5683211_1.fastq a2/SRR5683211.sra_1.fastq
==> a1/SRR5683211_1.fastq <==
@SRR5683211.1 1 length=76

==> a2/SRR5683211.sra_1.fastq <==
@SRR5683211.sra.1 1 length=76

the fastq dumped by fasterq has an extra .sra on the record name ! every record on the file will have an extra 8 bytes because of this, because the record name appears both on the sequence and the quality record.

so if we take the 7.6G (8085142404) and add 8 bytes per record we have: 8085142404 + (8 * 34741824) = 8363076996 = 7.8G

so the file created by fasterq dump is only a little bit bigger because of the record names, but the actual sequence data is the same.

millerh1 commented 4 years ago

Oh -- very good catch! Thank you for putting my mind at ease about using parallel-fastq-dump!