marbl / CHM13

The complete sequence of a human genome
Other
883 stars 96 forks source link

Less data in rel7 than rel6? #24

Closed ktan8 closed 3 years ago

ktan8 commented 3 years ago

I've downloaded both the rel6 and rel7 fastq files. The rel6 fastq is 352GB in size while the rel7 fastq is only 100GB in size. Can I check if data is missing in the rel7 fastq files?

skoren commented 3 years ago

The number of reads should be identical between rel6 and rel7, have you checked that (total bases will be close but not exact since the base callers change the lengths of some reads)? The gzipped file is much smaller because bonito has no base qualities (all identical) so it's essentially a fasta file and can compress better.

ktan8 commented 3 years ago

Thanks Sergey! I did a line count and you're right when you pointed out that the number of reads were the same. It's still fairly surprising to me though that if you're to subtract half the size for the basequals in rel6, it'll still be ~170GB and still substantially larger than rel7. Perhaps Bonito basecalls fewer bases than guppy, and it's just a feature of the software.

Thanks for your reply!