samtools / htslib

C library for high-throughput sequencing data formats
Other
811 stars 447 forks source link

__hts_idx_t uncompressed and compressed offset getters #845

Open brainstorm opened 5 years ago

brainstorm commented 5 years ago

Hello htslib devs,

I'm starting to work on a htsget-server implementation that relies on htslib, based on preliminary work on https://github.com/samtools/hts-specs/pull/385. The first issue I've found is finding public getters for indexes, namely the (64 bit) compressed and uncompressed offsets. As @delagoya points out in his binary formats blogpost from 2016:

There are no tools that independently inspect this binary format and spit out the information into text, only tools (e.g. PICARD, samtools) that use the index for random access into BAM files. If I wanted that information I would either need to calculate the file offsets myself (as is done in the excellent htsnexus) or implement my own binary file parser for BAI files.

I have been writing some small C snippets to see if that's still the case today. Thanks @jmarshall for the handy htslib C scaffold, very useful to get started quickly. I'm not sure I can externally access hts_idx_t attributes I need via any of methods/getters currently available?

In other words, I just need an exported method to access the tuple of 64bit pointers pointing to the compressed and uncompressed block offsets. The only (not publicly) available ones seem to be:

    // $ nm ../htslib/hts.o | grep load
    // 000000000000aae0 T _hts_idx_load
    // 000000000000ab60 T _hts_idx_load2
    // 000000000000b550 t _hts_idx_load_core

But the compiler yells: hts_idx_t struct is: "pointer to incomplete class is not allowed" for the aforementioned struct, as expected.

Since I'm a newcomer to the codebase I'm probably missing some way to do it right, so I just wanted to ask first before adding my own functions to retrieve the block offsets and file a pullrequest that might not get merged.

Happy to hear about pointers (no pun intended) and advice, perhaps this is being discussed offline ATM since crypt4gh is ongoing and there seems to be some probably related ABI-breaking changes queued as well in this issue tracker?

Thanks in advance!

/cc @ohofmann @reisingerf

jmarshall commented 5 years ago

There is currently no HTSlib API exposing this level of detail for indexes. If you want to look at the raw offsets in the index, you get to read in and parse the index file yourself (the formats are described in the SAM and CRAM specifications).

HTSlib almost got some useful APIs for this stuff back when I was working on htsget in HTSlib. Changes like 5869a67c3c19aa9e2b3b356269dadaf9aff1c669 (as corrected by 23d7f17d809b3da59513c9a656e6c69f52175bb6) were made so that the chunks returned by htsget server code using hts_itr_t would be tightened up — previously they were somewhat open-ended as hts_itr_next() skips to the next block of offsets once the coordinates go past the query range anyway.

The exp/raw-index branch contains the proposed APIs that were the followup to this work. These were intended to be what you would need in order to implement an htsget server: hts_itr_next_chunk() for accessing the index's raw virtual offsets as filtered by a query region, and hts_send_chunk() for blasting the selected data back to a client.

Perhaps that branch should be dusted off and updated to handle multi-iterators too…

jmarshall commented 5 years ago

See also #385.

daviesrob commented 5 years ago

@jmarshall Yes, that looks like the sort of thing that would be needed. It should handle the multi-iterators too now, as they're essentially the same as the single ones. Cram support would be nice, though (should be easier in some ways as you can never get a partial block).

@brainstorm crypt4gh, in its current form, suffers from the same problem as BGZF in that the BGZF or CRAM block boundaries are unlikely to match up with the 64K crypt4gh blocks. Even worse, if you rewrite a partial BGZF block then you now have something that needs to be spliced into the crypt4gh block stream which is unlikely to be exactly 64Kbytes long - and you have to do this in a way that doesn't accidentally compromise the encryption of the file. This would likely need support for encrypting parts of the file with different keys, and adding instructions about how to splice the results together.

brainstorm commented 4 years ago

Thanks a ton for all the detailed explanations and references to the issues around voffsets and htsget+htslib, very useful stuff.

This topic has now come back to my attention and I'm ready to dedicate a focused effort for a limited time. So I went through issue #385, the https://github.com/samtools/htslib/compare/exp/raw-index, recent (including the newest 1.10 release) and aforementioned commits and I guess that today the case for reading index files to extract voffsets still goes through the seek & tell route over alignment (BAM) files? I'm mentioning this related to @jmarshall earlier comment:

There is currently no HTSlib API exposing this level of detail for indexes. If you want to look at the raw offsets in the index, you get to read in and parse the index file yourself (the formats are described in the SAM and CRAM specifications).

I've seen the new on-the-fly index construction feature... is this something you implemented to target use cases like htsget or aimed at other issues? Is it possible to access voffsets from those in-memory indexes?

My present design objective would be to leverage pre-existing .BAI/.CRAI indexes on disk to calculate voffsets from them, without having to re-scan .BAM files... is this something that you have contemplated before to include in htslib or are there any limitations with that approach I should know of?

At this point, would it make sense to get this functionality in htslib proper? I could refresh that branch as mentioned and extend from there if it stands a chance on getting merged.

Thanks a lot in advance!

/cc @ohofmann

jkbonfield commented 4 years ago

The indexing on-the-fly was my work and I did it primarily because it's something that users have asked for in the past and it simply makes sense. I hadn't specifically had any thoughts on exposing the internals or integration with anything specific such as htsget.

It's possible some of the multi-region iterator code could be leveraged to turn region queries into byte ranges in a more public manner as it has to make the assessment and judgement of when two sequence ranges correspond to the same or overlapping byte ranges. I think at present though these structures are opaque.