pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
786 stars 273 forks source link

tabix_index silently fails on GZ compressed files #760

Open lecardozo opened 5 years ago

lecardozo commented 5 years ago

Hi,

I tried to create a tabix index from a Gencode compressed GFF3 file, specifically this one:

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gff3.gz

but it failed silently. After some investigation and some tweaking on the htslib source, I found the problem was that this file was GZ-compressed instead of BGZ-compressed, causing the tbx_index_build3 function to return prematurely here: https://github.com/pysam-developers/pysam/blob/e69a124140da5d10f1da7dd8095bea8dc46bff1a/htslib/tbx.c#L273

I think the slightest change that would improve the experience here is just telling the user what is the problem:

 if ( bgzf_compression(fp) != bgzf ) {
+   hts_log_error("Your file is not BGZF-compressed.")
    bgzf_close(fp); 
    return -1;
 } 

A more "sophisticated" solution would be checking the type of compression and automatically converting from GZ to BGZ in the tabix_index function. I implemented this functionality here and would be willing to submit a PR, if desired. :smile:

Thanks!

raivivek commented 5 years ago

Neat -- Do you think you might propose this on the htslib repository as this change directly relates to the underlying library?

jmarshall commented 5 years ago

If that if statement was triggered, then it shouldn't have failed silently because the OSError should have been raised:

https://github.com/pysam-developers/pysam/blob/e69a124140da5d10f1da7dd8095bea8dc46bff1a/pysam/libctabix.pyx#L1034-L1035

If there's changes to be made, they're probably in tabix_index(), which makes this a pysam change.

lecardozo commented 5 years ago

Sorry, I think I wasn't clear on that. The OSError is raised, as expected, but the error message is generic, even though the real cause behind this error is known by the tbx_index_build3() function (input file was gzipped instead of bgzipped).

The htslib tabix documentation is clear on the fact that it expects a sorted .bgz file, but the error could be more informative than tbx_index_build_failed: filename.

Should I transfer this discussion to the htslib repository?

jmarshall commented 5 years ago

Other error paths out of tbx_index_build3() set errno, so this one ought to too. That would be an issue to raise against htslib.

But the Python tabix_index() function does lots of work up front before calling tbx_index_build2 — including, it appears, using hts_get_format() [1] to detect the file's exact format. So it could easily check gz vs bgzf up front and output a good error message itself as appropriate. Since it's doing all that already, it seems best to do a little more there and fix this in the Python layer.

[1] It should be using hts_get_format() rather than fp.format.format, which is illegitimate access to htslib's internals and may fail in future.

lecardozo commented 5 years ago

Other error paths out of tbx_index_build3() set errno, so this one ought to too. That would be an issue to raise against htslib.

Yes! I'll open an issue on htslib w.r.t. that.

Since it's doing all that already, it seems best to do a little more there and fix this in the Python layer.

In this branch I added all the handling in the Python/Cython layer, with the addition of automatic conversion from gz to bgz (which I really don't know if is a desired feature or not).