Open lecardozo opened 5 years ago
Neat -- Do you think you might propose this on the htslib repository as this change directly relates to the underlying library?
If that if
statement was triggered, then it shouldn't have failed silently because the OSError should have been raised:
If there's changes to be made, they're probably in tabix_index()
, which makes this a pysam change.
Sorry, I think I wasn't clear on that. The OSError
is raised, as expected, but the error message is generic, even though the real cause behind this error is known by the tbx_index_build3()
function (input file was gzipped instead of bgzipped).
The htslib tabix documentation is clear on the fact that it expects a sorted .bgz file, but the error could be more informative than tbx_index_build_failed: filename
.
Should I transfer this discussion to the htslib repository?
Other error paths out of tbx_index_build3()
set errno
, so this one ought to too. That would be an issue to raise against htslib.
But the Python tabix_index()
function does lots of work up front before calling tbx_index_build2
— including, it appears, using hts_get_format()
[1] to detect the file's exact format. So it could easily check gz vs bgzf up front and output a good error message itself as appropriate. Since it's doing all that already, it seems best to do a little more there and fix this in the Python layer.
[1] It should be using hts_get_format()
rather than fp.format.format
, which is illegitimate access to htslib's internals and may fail in future.
Other error paths out of
tbx_index_build3()
seterrno
, so this one ought to too. That would be an issue to raise against htslib.
Yes! I'll open an issue on htslib w.r.t. that.
Since it's doing all that already, it seems best to do a little more there and fix this in the Python layer.
In this branch I added all the handling in the Python/Cython layer, with the addition of automatic conversion from gz to bgz (which I really don't know if is a desired feature or not).
Hi,
I tried to create a tabix index from a Gencode compressed GFF3 file, specifically this one:
but it failed silently. After some investigation and some tweaking on the htslib source, I found the problem was that this file was GZ-compressed instead of BGZ-compressed, causing the
tbx_index_build3
function to return prematurely here: https://github.com/pysam-developers/pysam/blob/e69a124140da5d10f1da7dd8095bea8dc46bff1a/htslib/tbx.c#L273I think the slightest change that would improve the experience here is just telling the user what is the problem:
A more "sophisticated" solution would be checking the type of compression and automatically converting from GZ to BGZ in the
tabix_index
function. I implemented this functionality here and would be willing to submit a PR, if desired. :smile:Thanks!