Closed davidtsao closed 4 years ago
Hi, I guess the issue is that the the gzip'ed file is actually a concatenation of multiple gzip files? This is no problem for gzip, but bgzip and pysam will stop at the end of the first component thinking that the file is complete.
The fastq.gz files I'm working with came directly from Illumina's fastq generation process, so unfortunately I have no control over it.
My current workarounds are to either 1) not use pysam for iterating over fastq records or 2) re-compress the fastq.gz file (e.g. gunzip example.fastq.gz && gzip example.fastq
)
Perhaps there could be a way to log a warning when this occurs, or a note in the documentation? It led to a hard to trace bug in my code. I'm happy to submit a pull request for a documentation change.
If you could provide a PR for docs, that would be great, thanks! I agree that this is a serious issue and leave it open - if I have time will look into this to see if the gzip iterator is recoverable from a bogus EOF.
Any follow-ups of this issue?
Without an example file, it is unclear whether this might be early stopping due to extra BGZF trailer blocks (samtools/htslib#45) or early stopping due to multiple GZIP members in a plain GZIP (non-bgzipped) file (samtools/htslib#742) or due to another reason.
The former has been fixed since HTSlib 1.4 so it is probably not that.
The latter has been fixed on HTSlib develop but after the 1.9 release. So if this problem is that, it will be fixed in an upcoming release.
I guess I'm in the second problem, looking forward to it. Thank you.
Is a fix on the way for this? I am still experiencing this issue.
There is now an HTSlib release containing the fix for the second problem noted in https://github.com/pysam-developers/pysam/issues/738#issuecomment-487958180, so when pysam's HTSlib is updated that problem will go away.
If you would like to provide an example file exhibiting the issue you're experiencing, we can investigate whether the problem is one of those two or a different problem still needing fixing.
@jmarshall Here is a link to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:
num_reads = 0
with pysam.FastxFile(path_to_fastqgz) as fastqgz_file:
for entry in fastqgz_file:
num_reads +=1
print("num reads: " + str(num_reads))
And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470. I am using pysam==0.15.3, and python3.7.4
We are working on a (large) update to pysam to support htslib 1.10.x. There is no specific ETA yet.
-Kevin
On Tue, Dec 17, 2019 at 2:11 PM Zachary Munro notifications@github.com wrote:
@jmarshall https://github.com/jmarshall Here is a link https://drive.google.com/file/d/16TDoy9hibyYCSL7NH4Z9HNV9wsfLsq_g/view?usp=sharing to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:
num_reads = 0 with pysam.FastxFile(path_to_fastqgz) as fastqgz_file: for entry in fastqgz_file: num_reads +=1 print("num reads: " + str(num_reads))
And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470. I am using pysam==0.15.3, and python3.7.4
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pysam-developers/pysam/issues/738?email_source=notifications&email_token=AAJT5PYS7V2GORPXU5VIV5TQZEP5JA5CNFSM4GAWDWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHDT5NQ#issuecomment-566705846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJT5P7RE5QWWBPTSS5EIF3QZEP5JANCNFSM4GAWDWXQ .
@zmunro: I can confirm that your file is one that is fixed by samtools/htslib#744 so will work in an upcoming pysam release incorporating the recent HTSlib release.
This has been fixed since pysam 0.16.0 (when used with HTSlib 1.10 or later).
I've noticed that using pysam.FastxFile to iterate over a bgzip'd fastq file only returns a subset of fastq records. The expected result is that all records are iterated over.
Example:
example.fastq.gz contains 1,009,470 reads.
When I iterate over the compressed fastq file using pysam, I only get 26,107 reads instead of the expected 1,009,470 reads.
I've traced the issue back to bgzip not decompressing all the contents of the fastq file.
For example, I get the expected ~4M lines using gunzip to decompress:
But using bgzip to decompress only returns ~104k lines, which is the same result obtained from pysam (26107 * 4 = 104428):