samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Flaw in tribble tabix test #1398

Closed jrobinso closed 5 years ago

jrobinso commented 5 years ago

There is a flaw in the following logic in htsjdk.tribble.AbstractFeatureReader for tabix indexed files loaded by URL. I suspect I'm the culprit here, but am posting in case anyone has an idea for a fix. Also I would like to get an opinion on the proposed fix from @lbergelson and others who work with Tribble code. Apologies for the length of the post.

    public static boolean isTabix(String resourcePath, String indexPath) throws IOException {
        if (indexPath == null) {
            indexPath = ParsingUtils.appendToPath(resourcePath, ".tbi");
        }

        return IOUtil.hasBlockCompressedExtension(resourcePath) && 
        ParsingUtils.resourceExists(indexPath);
    }

The purpose of this test is to determine if the file represented by "resourcePath" is a tabix indexed file. The test has 2 parts, first it checks for a ".gz" extension, then it checks to see if a tabix index exists. The second test is performed by appending ".tbi" to the path and checking for existence. This is flawed for 2 reasons: (1) in general cloud URLs cannot be constructed from such a simple algorithm, so the existence test will fail even if there is a tabix indexed file, (2) for Dropbox URLs the filename shown is irrelevant, it does not identify the file, so this test will always pass even if there is no tabix index. The second failure mode is more serious.

By way of illustration, these urls return the same file, the fle identifier is 4fk2wk0eulam1j0, everything after the last slash is irrelevant.

https://www.dropbox.com/s/4fk2wk0eulam1j0/genes.gtf.gz?dl=0
https://www.dropbox.com/s/4fk2wk0eulam1j0/foo.bar?dl=0

There is no easy fix unless someone has a means of testing for "tabix-ness" without relying on an index URL. I think the best thing to do here is to remove this feature of tribble, and treat a remote file as tabix indexed only when an index file is explicitly defined. We can keep the tabix inference convenience for local files. If this strategy is agreeable I will work on it and submit a PR.

jrobinso commented 5 years ago

I think the simplest fix for this is to simply not try to guess the indexPath. Clients who want to use a tabix index should supply the index path. So the method above becomes

    public static boolean isTabix(String resourcePath, String indexPath) throws IOException {
        return indexPath != null &&  IOUtil.hasBlockCompressedExtension(resourcePath) && ParsingUtils.resourceExists(indexPath);
    }
lbergelson commented 5 years ago

@jrobinso Thanks for bringing this up. We've had similar issues to the first one with files in the cloud. Going forward I think all discovery of indexes / companion files is something that should be the responsibility of the client code. I'm not sure it's a change we can make in general to the existing API though without a lot of disruption. I'm definitely open to changing the behavior for the specific case of http urls though as you suggest.

About the first error mode. Are you having issues with files being discounted because they don't end in .gz? We could open a stream and detect if it looks like a bgzip file and recover some false negatives that way. It's a bit expensive but probably not a real bottleneck in most workflows. I'm not sure about the case where it can't identify companion though, I think guessing and failing is the best we can do there if no index path is provided. We could change error messages to make it clear that it might be failing because the index can't be found and not because the file isn't tabix indexed if that would make things clearer.

The second error mode is definitely more insidious. I'm assuming that's the one that's practically giving you trouble? I'm not sure how to avoid it easily. We could of course just try decoding the file at the url and see if we can read it as a tabix file / index pair, but that seems very expensive. Is there any standard http request to ask if two urls refer to the same underlying resource or to ask for a canonicalized unique URL? I haven't heard of anything like that so I'm assuming not.

The idea of disabling the check for specific domains seems practical, but very hacky as you say. At the very least we'd need to make the list of domains configurable by the users and not hardcoded into htsjdk. Maybe an additional entry in Defaults could handle it. I think that is probably better than breaking it for users who do have filesystem like URL schemes.

lbergelson commented 5 years ago

Sorry, you posted between me starting to write something and finishing it... I'm a little wary about disabling it for non-http urls since I think most downstream code in picard/gatk relies on htsjdk to discover the index paths for it. I suspect a lot of tools will break in a fairly significant way if we change it to no longer guess the index path. That said, it would be good to migrate in that direction, so maybe there's a way we could get there incrementally and safely through introduction of newer methods and deprecation of the existing ones.

jrobinso commented 5 years ago

@lbergelson Thanks for the fast response.

I totally understand how changes to this behavior could be disruptive. I think the current behavior should be kept for local files for sure, so something like if(indexPath == null && isLocal(resourcePath)...

Another option for keeping the current behavior would be to add some public static method on AbtractFeatureReader which sets a "GUESS_INDEX" flag, defaulting to true.

    public static boolean isTabix(String resourcePath, String indexPath) throws IOException {
        if (indexPath == null && (isLocal(resourcePath) || GUESS_INDEX)) {
            indexPath = ParsingUtils.appendToPath(resourcePath, ".tbi");
        }

        return indexPath != null &&  IOUtil.hasBlockCompressedExtension(resourcePath) && 
        ParsingUtils.resourceExists(indexPath);
    }

For Dropbox URLs the current behavior will break for anyone relying on it, and in a nasty way that is hard to understand, the code will read the .gz file as a tabix index. For Google Drive and signed Amazon URLs the code will fail in the opposite way, a potential false negative, and the file will be treated as a non-tabix file even if an index exists. However this is not so serious a bug as the indexPath guessing is a "convenience", clients really should be passing it explicitly.

lbergelson commented 5 years ago

@jrobinso Sorry for the delay, I was out with a sick kid. I looked and was surprised to see that there already IS a mechanism for changing this behavior baked in. See AbstractFeatureReader.ComponentMethods and AbstractFeatureReader.setComponentMethods. It looks like @jsilter added it years ago. Would that work well for you or do think we should make a global change to htsjdk?

jrobinso commented 5 years ago

Ahh, clever, although a bit confusing. Yes I can use that for IGV. RE global change, I guess that depends on whether you care about Dropbox, but I imagine you have higher priority issues.

I'll close this, thanks again for the quick response.