samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

truncated gzip (not bgzf) leads to infinite loop when parsign a fastq files #1579

Closed goranvinterhalter closed 1 year ago

goranvinterhalter commented 1 year ago

Hi All,

Parsing a truncated fastq.gz file (not bgzf) leads to infinite loop problems.

There is a fix for this in klib (in this PR) since 2017.

Could it be this was overlooked in htslib or is there another reason why this fix is not in kseq.h?

jkbonfield commented 1 year ago

Thanks for this bug report.

You are correct in that we overlooked this revision. I'm now watching Heng's klib, so incase there are other bug fixes we can review. I'll also look to see if there are other bug fixes we should have incorporated.

jkbonfield commented 1 year ago

Curiously the test data in https://github.com/attractivechaos/klib/issues/78 doesn't trigger problems for samtools view and test/test_view. They correctly identify the broken CRC. I'll still review the changes, but could you please explain what command you're using to hit this bug?

Or are you using htslib/kseq.h directly from your own tool?

goranvinterhalter commented 1 year ago

I'm using it directly from my own tool. I can confirm the klib version works, in case of a corrupt ".gz" file the kseq_read returns -3. Note it has to be regular 'gz' not 'bgzf' compressed fastq file.