pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
786 stars 273 forks source link

FastxFile truncates gzip compressed fastq files #738

Closed davidtsao closed 4 years ago

davidtsao commented 6 years ago

I've noticed that using pysam.FastxFile to iterate over a bgzip'd fastq file only returns a subset of fastq records. The expected result is that all records are iterated over.

Example:

example.fastq.gz contains 1,009,470 reads.

When I iterate over the compressed fastq file using pysam, I only get 26,107 reads instead of the expected 1,009,470 reads.

In [1]: import pysam

In [2]: len([x for x in pysam.FastxFile('example.fastq.gz')])
Out[2]: 26107

I've traced the issue back to bgzip not decompressing all the contents of the fastq file.

For example, I get the expected ~4M lines using gunzip to decompress:

$ bgzip -c -d example.fastq.gz | wc -l 
 4037880

But using bgzip to decompress only returns ~104k lines, which is the same result obtained from pysam (26107 * 4 = 104428):

$ bgzip -c -d example.fastq.gz | wc -l
  104428
AndreasHeger commented 6 years ago

Hi, I guess the issue is that the the gzip'ed file is actually a concatenation of multiple gzip files? This is no problem for gzip, but bgzip and pysam will stop at the end of the first component thinking that the file is complete.

davidtsao commented 6 years ago

The fastq.gz files I'm working with came directly from Illumina's fastq generation process, so unfortunately I have no control over it.

My current workarounds are to either 1) not use pysam for iterating over fastq records or 2) re-compress the fastq.gz file (e.g. gunzip example.fastq.gz && gzip example.fastq)

Perhaps there could be a way to log a warning when this occurs, or a note in the documentation? It led to a hard to trace bug in my code. I'm happy to submit a pull request for a documentation change.

AndreasHeger commented 6 years ago

If you could provide a PR for docs, that would be great, thanks! I agree that this is a serious issue and leave it open - if I have time will look into this to see if the gzip iterator is recoverable from a bogus EOF.

cgjosephlee commented 5 years ago

Any follow-ups of this issue?

jmarshall commented 5 years ago

Without an example file, it is unclear whether this might be early stopping due to extra BGZF trailer blocks (samtools/htslib#45) or early stopping due to multiple GZIP members in a plain GZIP (non-bgzipped) file (samtools/htslib#742) or due to another reason.

The former has been fixed since HTSlib 1.4 so it is probably not that.

The latter has been fixed on HTSlib develop but after the 1.9 release. So if this problem is that, it will be fixed in an upcoming release.

cgjosephlee commented 5 years ago

I guess I'm in the second problem, looking forward to it. Thank you.

zmunro commented 4 years ago

Is a fix on the way for this? I am still experiencing this issue.

jmarshall commented 4 years ago

There is now an HTSlib release containing the fix for the second problem noted in https://github.com/pysam-developers/pysam/issues/738#issuecomment-487958180, so when pysam's HTSlib is updated that problem will go away.

If you would like to provide an example file exhibiting the issue you're experiencing, we can investigate whether the problem is one of those two or a different problem still needing fixing.

zmunro commented 4 years ago

@jmarshall Here is a link to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:

num_reads = 0
with pysam.FastxFile(path_to_fastqgz) as fastqgz_file:
    for entry in fastqgz_file:
        num_reads +=1
print("num reads: " + str(num_reads))

And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470. I am using pysam==0.15.3, and python3.7.4

bioinformed commented 4 years ago

We are working on a (large) update to pysam to support htslib 1.10.x. There is no specific ETA yet.

-Kevin

On Tue, Dec 17, 2019 at 2:11 PM Zachary Munro notifications@github.com wrote:

@jmarshall https://github.com/jmarshall Here is a link https://drive.google.com/file/d/16TDoy9hibyYCSL7NH4Z9HNV9wsfLsq_g/view?usp=sharing to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:

num_reads = 0 with pysam.FastxFile(path_to_fastqgz) as fastqgz_file: for entry in fastqgz_file: num_reads +=1 print("num reads: " + str(num_reads))

And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470. I am using pysam==0.15.3, and python3.7.4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pysam-developers/pysam/issues/738?email_source=notifications&email_token=AAJT5PYS7V2GORPXU5VIV5TQZEP5JA5CNFSM4GAWDWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHDT5NQ#issuecomment-566705846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJT5P7RE5QWWBPTSS5EIF3QZEP5JANCNFSM4GAWDWXQ .

jmarshall commented 4 years ago

@zmunro: I can confirm that your file is one that is fixed by samtools/htslib#744 so will work in an upcoming pysam release incorporating the recent HTSlib release.

jmarshall commented 4 years ago

This has been fixed since pysam 0.16.0 (when used with HTSlib 1.10 or later).