Closed millerh1 closed 4 years ago
hello,
I just downloaded the same SRR and did some tests.
first thing to notice is the number of lines/records in each file. they both have 138967296 lines and 34741824 fastq records. so no record is missing.
so why they have different sizes ? compare the first line from each file:
$ head -n1 a1/SRR5683211_1.fastq a2/SRR5683211.sra_1.fastq
==> a1/SRR5683211_1.fastq <==
@SRR5683211.1 1 length=76
==> a2/SRR5683211.sra_1.fastq <==
@SRR5683211.sra.1 1 length=76
the fastq dumped by fasterq has an extra .sra
on the record name !
every record on the file will have an extra 8 bytes because of this, because the record name appears both on the sequence and the quality record.
so if we take the 7.6G (8085142404) and add 8 bytes per record we have:
8085142404 + (8 * 34741824) = 8363076996 = 7.8G
so the file created by fasterq dump is only a little bit bigger because of the record names, but the actual sequence data is the same.
Oh -- very good catch! Thank you for putting my mind at ease about using parallel-fastq-dump!
Hello! It looks like parallel-fastq-dump creates fastq files which are not the same size as the fastq files created by fasterq-dump.
For example:
Fasterq size:
Parallel fastq size:
I've noticed this before and sometimes the data loss is very large -- especially when I don't use prefetch first. Do you know what may be causing this behavior?