samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Exception is thrown when loading the BAM from a URL #1507

Closed SergejN closed 2 years ago

SergejN commented 4 years ago

Description of the issue:

Dear htsjdk team,

I encountered a strange bug when trying to display a BAM file in IGV. I work with the Mexican axolotl (genome size 32Gb). Therefore, I have to use CSI indices for the BAM files. In IGV, if I specify the URL to the BAM file and provide an explicit URL to the CSI index, an exception is thrown in

protected void verifyIndexMagicNumber(final String sourceName) {
        // Verify the magic number.
        seek(0);
        final byte[] buffer = new byte[4];
        readBytes(buffer);
        if (!Arrays.equals(buffer, BAMFileConstants.BAI_INDEX_MAGIC)) {
            throw new RuntimeIOException("Invalid file header in BAM index " + sourceName +
                    ": " + new String(buffer));
        }
    }

https://github.com/samtools/htsjdk/blob/edb8c602bbc20ff268e56282163f65911eabe379/src/main/java/htsjdk/samtools/AbstractBAMFileIndex.java#L372

I tried my best to check what is going on there. It seems that the path/url is correct in IGV itself. It is also correct in htsjdk, because the error message (see screenshot) has the correct path in it. image

I checked if the file is corrupted. If I download the files from the server and open them locally, everything works as expected. Next, I tested if the error occurs generally (suggesting a mis-configuration of the server) or is related to the CSI file. I extracted the reads from a locus chr10p:1-500000000 to make sure I can use the BAI index. With the BAI index everything works just fine. From the code above it seems that htsjdk tries to load the index as BAI even if it's a CSI index.

Your environment:

Steps to reproduce

Start IGV 2.8 and load the reference and the BAM files locally and through the URL. The reads will be displayed if the BAM file is loaded locally and fail to be displayed if the file is fetched from the URL. Since the path seems to be correct in the error message, I guess the error occurs within htsjdk. However, to be on the safe side, I add the link to the IGV issue (https://github.com/igvteam/igv/issues/843) I opened yesterday and also add @jrobinso to this issue. Unfortunately, I cannot provide a path to our server since it is not accessible from the internet. But I put the relevant data to Dropbox. https://www.dropbox.com/sh/dsgj4fe0gpej4jm/AADWmPPzQnOGkSnL6SUx3Ceja?dl=0

Important note: yesterday, when I tried to get it to run, I checked if it worked if I'd rename the CSI file to BAI (it does with the UCSC browser, though, therefore, it wasn't a complete stupid assumption). It did not. However, when I created the test dataset for IGV bug report, I forgot to rename the file back to CSI. So, what you have in the archive is a CSI file with the wrong extension. It's not the reason for the bug reported here.

Expected behaviour

CSI index should be loaded correctly

Actual behaviour

htskdk interprets the index as BAI

thank you! Best, Sergej

jrobinso commented 4 years ago

I've confirmed this and am working on a simple test case. Here is how IGV instantiates reader. It works if the bam and csi are local, fails if loading by URL

        final SamReaderFactory factory = SamReaderFactory.makeDefault().
                referenceSource(new IGVReferenceSource()).
                validationStringency(ValidationStringency.SILENT);
        SamInputResource resource = SamInputResource.of(IGVSeekableStreamFactory.getInstance().getStreamFor(url));
        SeekableStream indexStream = IGVSeekableStreamFactory.getInstance().getStreamFor(HttpUtils.createURL(indexPath));
        resource = resource.index(indexStream);
        return factory.open(resource);

I've put test data here

https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam
https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam.csi
lbergelson commented 4 years ago

@SergejN @jrobinso Thank you for reporting this and for the help reproducing it. I'm not surprised that there is a bad interaction somewhere with URLs and CSI. Both are sort of specially handled and there must be a bad interaction, probably in the SamFactory open method. I suspect it will be easy enough to track down now that you found the problem.

lbergelson commented 4 years ago

Ok. I found the problem. It's because it's loading the index as a stream in this case. We never updated that path to check if it's bai or csi explicitly and it just guesses that it's probably bai.

You can see the problem here:

   if (samIndex == null) {
                mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary())
                        : new DiskBasedBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
            } else if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
            } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
            } else {
                throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
            }
SergejN commented 4 years ago

great job @lbergelson ! I hope it's easy to fix

jrobinso commented 4 years ago

@lbergelson I've confirmed that cram indexes (.crai) work by URL. Not that it was a question, but thought it worth checking.

There is a related potential problem elsewhere where the file extension is used to determine index type. Actually not even the extension necessarily but the end of the string (urls can have parameters). This is problematic for certain URLs that are basically mangled hashes, such as Amazon signed URLs and Google Drive urls. Perhaps we can solve both problems by reading the first few bytes of an index to check its magic number. I can look into that if you want to go that direction. Alternatively, and perhaps less disruptively, we could add a parameter to tell the htsjdk explicitly what the index type is.

whaleberg commented 4 years ago

In this case we can scan the first bytes and determine if it's csi/bai. That's probably what I'll do here. In the future I want to let you specify the type explicitly but that's a longer term change I think.

jrobinso commented 3 years ago

@whaleberg @lbergelson @SergejN Just checking, where are we with this?

jrobinso commented 3 years ago

bump

SergejN commented 3 years ago

any news?

tougai commented 3 years ago

Hello, i am having the same issue and interested on its resolution as well, any idea when this will be solved ?

lbergelson commented 3 years ago

Hi, wanted to update. I intend to patch this but I haven't gotten a chance. I've been running at much reduced capacity lately due to the pandemic and working at home. Sorry that it's taking so long!

tougai commented 3 years ago

Any updates ?

FredericBGA commented 3 years ago

Hello Any news regarding this issue? The last version on IGV seems to be still unable to open a BAM and its .csi index from an URL. https://github.com/igvteam/igv/issues/843 Thank you.

jrobinso commented 3 years ago

@lbergelson @whaleberg Can I suggest something like this as a solution (in BAMFileReader)? The idea is to peek at the stream to see if its gzipped, if it is guess "CSI", which is probably correct. A more rigorous test would check the CSI magic number, but to do that you first have to see if its gzipped, and gunzip the first block. I think it would suffice, and be a big improvement, to just check for gzip and return "CSI" if it is. We aren't validating BAI magic numbers here.

Relying on file index from URLs is problematic, and will for sure break for certain providers (e.g. Google Drive and potentially signed Amazon URLs).


    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
             ...
        } 
           else if(mIndexStream != null) {
                  // Read the first 2 bytes from mIndexStream to see if its gzipped, if it is its probably CSI (and is definitely not BAI).
        }

        return null;
    }
jrobinso commented 3 years ago

@lbergelson @whaleberg Actually just determining type is not enough, getIndex() assumes there is an mIndexFile.

Another proposal, provide a constructor or factory function that I can pass a BAMIndex to, and IGV will deal with determining the type and reading the index. Perhaps this exists, but I don't see it. I'm currently using SamInputResource methods to create readers in this class https://github.com/igvteam/igv/blob/master/src/main/java/org/broad/igv/sam/reader/SamReaderPool.java

jrobinso commented 3 years ago

@lbergelson @whaleberg OK, 1 last attempt, see below. If not a file (mIndexFile == null) I do the magic bytes test on the stream. Actually for CSI I just check if its gzipped, so its a guess but a good one. Then when creating the index I also check if its a stream or file, in the current implementation it just assume that a type of "null" is a stream. This looks like it should work, but it fails later in CSIIndex checking the magic number, because it just reads the bytes from the supplied stream and does not gunzip (or bgunzip) first. Does that constructor assume a gzip stream? The supplied SeekableStream could be wrapped in a GUnzipStream but at this point I'm not sure how its supposed to work.

Sorry for the spam, hope this is helpful.


    /**
     * Retrieves the index for the given file type.  Ensure that the index is of the specified type.
     * @return An index of the given type.
     */
    @Override
    public BAMIndex getIndex() {
        if(!hasIndex())
            throw new SAMException("No index is available for this BAM file.");
        if(mIndex == null) {
            SamIndexes samIndex = getIndexType();
            if(mIndexFile != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
                }
            } else if(mIndexStream != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexStream,  getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index stream ");
                }
            }
        }

        return mIndex;
    }

    /**
     * Return the type of the BAM index, BAI or CSI.
     * @return one of {@link SamIndexes#BAI} or {@link SamIndexes#CSI} or null
     */
    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
            if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.BAI_INDEX)) {
                return SamIndexes.BAI;
            } else if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.CSI)) {
                return SamIndexes.CSI;
            }

            throw new SAMFormatException("Unknown BAM index file type: " + mIndexFile.getName());
        }
        // NEW STUFF STARTS HERE
        else if(mIndexStream != null) {
            long pos = -1;
            try {
                pos = mIndexStream.position();

                byte [] magicBytes = new byte[4];
                mIndexStream.readFully(magicBytes);
                if(Arrays.equals(magicBytes, BAMFileConstants.BAI_INDEX_MAGIC)) {
                    return SamIndexes.BAI;
                }
                else if(magicBytes[0] ==  31 && magicBytes[1] == -117) {  // JUST CHECKING FOR GZIPPED,
                    return SamIndexes.CSI;
                } else {
                    throw new SAMFormatException("Unknown BAM index magic number: " + new String(magicBytes));
                }

            } catch (IOException e) {
                // TODO -- log ?
            }
            finally {
                try {
                    mIndexStream.seek(pos);   // reset stream
                } catch (IOException e) {

                }
            }
        }
        // NEW STUFF ENDS HERE

        return null;
    }
FredericBGA commented 3 years ago

thank you for all of this @jrobinso

cmnbroad commented 2 years ago

I have a fix that seems to be working. Needs some more tests. Also, I need to follow up since I think there might be some leaks in the CSI code - needs a little more investigation.