samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 446 forks source link

API for low-level block offsets corresponding to queries #385

Open jeromekelleher opened 8 years ago

jeromekelleher commented 8 years ago

There is considerable interest in providing direct access to the compressed blocks within BAM/CRAM files corresponding to a given range query. It would be useful if htslib provided a low-level API to provide access to this information, to avoid the need for multiple implementations of the index processing code.

In psuedocode, the API would look something like:

bam_index_parser_t *ip = bam_index_parser_alloc(bam_index_filepath);
block_t *blocks;
size_t num_blocks;
err = index_parser_get_data_offsets(ip, reference_id, start, end, &num_blocks, &blocks);
for (j = 0; j < num_blocks; j++) {
    printf("block %d spans %d bytes from offset %d\n", j, blocks[j].size, blocks[j].offset);
}

Is this feasible? Or, is parsing the BAM index and deriving this information sufficiently simple that we don't need htslib to do it? If so, it would be great if someone could point me to a reference implementation!

lh3 commented 8 years ago

You can't get the number of blocks as this information is not available in the BAM index. It is technically possible to get the virual offsets in the first and last blocks that fully contain reads in a genomic region:

/**
 * @param idx   BAM index
 * @param tid   reference ID
 * @param start genomic start
 * @param end   genomic end
 * @param vs    start virtual offset (out)
 * @param ve    end virtual offset (out)
 */
int bam_voff_by_region(const bam_idx_t *idx, int tid, int start, int end, uint64_t *vs, uint64_t *ve);

The above should be relatively easy. The harder part is the next step: to decompress blocks containing and between vs and ve, and to iterate through them – unless BAM is block-aligned, we need to iterate through the first and the last blocks. Achieving this probably requires structural changes to BGZF and the BAM reader. This will be challenging.

jkbonfield commented 8 years ago

For GA4GH we dropped the requirement that data sent back on a range query only matched the data range requested. This means that for most short-read data it could be sufficient to simply identify the blocks matching the virtual start / end, potentially recode those if required, and then just shove out the rest ignorant of whether they contain reads outside of the requested range.

On mixed long and short read data it may be an issue - a single long read could cause, say, vs to be block#100 and ve to be block#200, but perhaps it's just block 100 and blocks 150-200 that actually span the region requested. Maybe this isn't a big issue though if we accept the client end needs to filter.

Ie what you need from an API is basically the virtual start / end locations and nothing more. The actual file I/O can be done yourself by normal open/seek/read loops.

I think in theory there are improvements that could be made to query the list of virtual offsets for a particular coordinate and determine which block locations may be permissable to omit, but perhaps it's not necessary in the first implementation.

jeromekelleher commented 8 years ago

Thanks @lh3 and @jkbonfield, this is very helpful. I agree that we don't need to filter out reads that don't intersect with the query interval, we just have to guarantee that we return all reads that do intersect with it. (Of course we could trivially do this by just returning the whole file each time, but that wouldn't be terrible useful!).

Just getting back the start and end offsets within the file would be great (do virtual offsets refer to byte offsets within the file?). For my application I definitely don't want the API to do I/O. The returned offsets will be encoded in JSON, and handed back to the client. The client will then make HTTP connections and pull down the blocks that they want.

jkbonfield commented 8 years ago

On Tue, Jun 07, 2016 at 12:16:33AM -0700, Jerome Kelleher wrote:

Just getting back the start and end offsets within the file would be great (do virtual offsets refer to byte offsets within the file?). For my application I definitely don't want the API to do I/O. The returned offsets will be encoded in JSON, and handed back to the client. The client will then make HTTP connections and pull down the blocks that they want.

Virtual offsets correspond to physical file offset of the start of the compressed block along with the relative offset within the uncompressed block (hence virtual). You cannot tell if this block starts on a read boundary though, but you do know the virtual offset is a read boundary so you'll simply have to slice the uncompressed block up at that point and recompress.

I'll also correct my previous comment as I fear it was too simple. I wish I hadn't added my thoughts now! It seemed like a good idea at the time :-)

The index is essentially an R-tree. It has a root bin per chromosome and then successive child bins that span regions entirely held within their parent bin. The index simply maps those prescribed bin locations to virtual offsets. Therefore there are a list of bins, potentially far apart, that need to be checked for data covering a specific location.

However... there is also the linear index which is just the virtual offset of the first read that start in that window. (Note "start in", not overlap). This almost covers what you need, but not quite.

Sorry I didn't think that through before. I've no idea how this works with the API either as I haven't personally used it.

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jeromekelleher commented 8 years ago

Virtual offsets correspond to physical file offset of the start of the compressed block along with the relative offset within the uncompressed block (hence virtual). You cannot tell if this block starts on a read boundary though, but you do know the virtual offset is a read boundary so you'll simply have to slice the uncompressed block up at that point and recompress.

Ah, that makes sense. So, if I just dump out the compressed block starting at the returned physical file offset, I won't get a well-formed BAM subset. Is this true for all BAMs? I ask this because I have a vague recollection of someone mentioning that having BAM records spanning a compressed block only happened in older BAM files. If this was true (and we could reliably detect it) then we could possibly just work around the problem by the (initial at least) API not working for unsupported BAMs.

Probably just wishful thinking on my part though!

jkbonfield commented 8 years ago

On Tue, Jun 07, 2016 at 02:51:09AM -0700, Jerome Kelleher wrote:

Ah, that makes sense. So, if I just dump out the compressed block starting at the returned physical file offset, I won't get a well-formed BAM subset. Is this true for all BAMs? I ask this because I have a vague recollection of someone mentioning that having BAM records spanning a compressed block only happened in older BAM files. If this was true (and we could reliably detect it) then we could possibly just work around the problem by the (initial at least) API not working for unsupported BAMs.

At some point (quite a while ago) samtools was modified so that if a read doesn't fit in the remainder of the current block it will start a new one. (That doesn't mean all blocks start on a read as a read may be large enough to span more than one block.) Unfortunately there isn't a BGZF header field to distinguish these blocks.

This isn't a requirement of the specification and there are obviously other tools out there producing BAM too. I've no idea what other tools do (not even my own come to think of it!).

So it's perhaps an optimisation that you can exploit, but cannot rely on.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jeromekelleher commented 8 years ago

At some point (quite a while ago) samtools was modified so that if a read doesn't fit in the remainder of the current block it will start a new one. (That doesn't mean all blocks start on a read as a read may be large enough to span more than one block.) Unfortunately there isn't a BGZF header field to distinguish these blocks.

Ah, I see.... Do you happen know (or think it's likely) if the 1000G BAMs have this property? All I need for now is a proof of concept, and running off the 1KG BAMs would do perfectly well to illustrate scalability. I'd be perfectly happy to hack together something quick that relies on (a) the never version of samtools being used, and (b) the reads being short enough that they always fit into a single block. Do such BAMs exist in the wild do you think?

jkbonfield commented 8 years ago

On Tue, Jun 07, 2016 at 03:32:05AM -0700, Jerome Kelleher wrote:

Ah, I see.... Do you happen know (or think it's likely) if the 1000G BAMs have this property? All I need for now is a proof of concept, and running off the 1KG BAMs would do perfectly well to illustrate scalability. I'd be perfectly happy to hack together something quick that relies on (a) the never version of samtools being used, and (b) the reads being short enough that they always fit into a single block. Do such BAMs exist in the wild do you think?

I'm afraid I've no idea about 1000 genomes files. I think you'd need to just try it and see. Most likely a little bit of manual dumping of the .bai coupled with some targetted dd statements to stitch a header onto a BAM block (repeated with several different blocks) would answer that.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jmarshall commented 8 years ago

Please have a look at the exp/raw-index branch, which as a first cut at this adds

int hts_itr_get_offsets(const hts_itr_t *iter, uint64_t *beg_off, uint64_t *end_off);

which returns the smallest and largest virtual offsets that an iterator looks at when iterating over a given range. (It turns out that this data is already stored in the hts_itr_t, alas sorry should have looked at that in detail a lot sooner…)

If you right-shift those by 16 bits, you'll get the physical file offsets of the starts of the first and last touched BGZF blocks.

jeromekelleher commented 8 years ago

Thanks @jmarshall, this is really useful. I've wired this up to pysam now, and it seems to be working. I'll see if I can create a server with this and report back.

jmarshall commented 8 years ago

Great, but also oh crap — I'm about to push an expansion of this which instead adds

int hts_itr_next_chunk(hts_itr_t *iter, uint64_t *beg_off, uint64_t *end_off);

to be called in a loop (while (hts_itr_next_chunk(…) >= 0)) that returns a (short) series of much tighter chunks. That hts_itr_get_offsets() turned out to be way too conservative, returning at least 2.7Mb no matter how small a region I asked for.

jeromekelleher commented 8 years ago

No no, this sounds great; go for it! I've only just figured out the mysteries of linking Cython up to C code, I have no real code at all written in terms using the function. A list of chunks sounds perfect.

bioinformed commented 8 years ago

As always, please let me know if you need any help with Pysam (python or Cython).

On Fri, Jul 1, 2016 at 8:20 AM, Jerome Kelleher notifications@github.com wrote:

No no, this sounds great; go for it! I've only just figured out the mysteries of linking Cython up to C code, I have no real code at all written in terms using the function. A list of chunks sounds perfect.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/385#issuecomment-229959307, or mute the thread https://github.com/notifications/unsubscribe/ABM-v7h2RI8USfj0UFLo9MP0JkOirIy6ks5qRSIrgaJpZM4Iu3LF .

jmarshall commented 8 years ago

Okay, I've updated exp/raw-index to provide this hts_itr_next_chunk() function instead.

jeromekelleher commented 8 years ago

Thanks for this @jmarshall. I'm a bit confused as to how to interpret these new values though. It seems that regardless of my input, I get back a list of pairs where all but the first are equal. For example, if I use /NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam, and do this:

for start, end in zip(*f.get_file_offsets("3", 0, 100000)):
    print(start, end, end - start, sep="\t") 

I get

2531056585      2531375585      319000
2531473896      2531473896      0
2532164543      2532164543      0
2532909512      2532909512      0
2533640041      2533640041      0
2534455495      2534455495      0
2535229907      2535229907      0
2535945147      2535945147      0
2536680845      2536680845      0
2542493846      2542493846      0
2548321487      2548321487      0
2554228357      2554228357      0
2560029705      2560029705      0
2565948275      2565948275      0
2571781523      2571781523      0
2577546905      2577546905      0
2621705404      2621705404      0
2668741492      2668741492      0
2714770668      2714770668      0
2760604126      2760604126      0
2804346499      2804346499      0
2848910754      2848910754      0
2894241829      2894241829      0
3247452819      3247452819      0

When I try different coordinates, I get the same pattern: only the first entry in the array seems to make sense.

Thanks for the offer of help @bioinformed! It would be really useful if you could have a quick look at my code here to see if I'm doing anything daft:

https://github.com/pysam-developers/pysam/compare/master...jeromekelleher:jk-experiment

jmarshall commented 8 years ago

I was over-summarising when I said to shift them right by 16 bits to see the raw file offsets.

Instead print them as raw-file-offset-of-start-of-block:byte-offset-within-block with printf("%lu:%lu", off >> 16, off & 0xFFFF) or so, and you'll see that the first chunk has several blocks in it, but the subsequent chunks are each just a read or two within a single BGZF block.

I am in the process of adding another function to this branch that should let your Python layer be something like

while hts_itr_next_chunk(iterator, &beg_off, &end_off) >= 0:
    hts_send_chunk(outfp, infp, beg_off, end_off)
bioinformed commented 8 years ago

The only obvious glitch I see is that you don't destroy the iterator before existing the function.

On Thu, Jul 7, 2016 at 7:50 AM, Jerome Kelleher notifications@github.com wrote:

Thanks for this @jmarshall https://github.com/jmarshall. I'm a bit confused as to how to interpret these new values though. It seems that regardless of my input, I get back a list of pairs where all but the first are equal. For example, if I use /NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam, and do this:

for start, end in zip(*f.get_file_offsets("3", 0, 100000)): print(start, end, end - start, sep="\t")

I get

2531056585 2531375585 319000 2531473896 2531473896 02532164543 2532164543 02532909512 2532909512 02533640041 2533640041 02534455495 2534455495 0 2535229907 2535229907 0 2535945147 2535945147 0 2536680845 2536680845 0 2542493846 2542493846 0 2548321487 2548321487 0 2554228357 2554228357 0 2560029705 2560029705 0 2565948275 2565948275 0 2571781523 2571781523 0 2577546905 2577546905 0 2621705404 2621705404 0 2668741492 2668741492 0 2714770668 2714770668 0 2760604126 2760604126 0 2804346499 2804346499 0 2848910754 2848910754 0 2894241829 2894241829 0 3247452819 3247452819 0

When I try different coordinates, I get the same pattern: only the first entry in the array seems to make sense.

Thanks for the offer of help @bioinformed https://github.com/bioinformed! It would be really useful if you could have a quick look at my code here to see if I'm doing anything daft:

pysam-developers/pysam@master...jeromekelleher:jk-experiment https://github.com/pysam-developers/pysam/compare/master...jeromekelleher:jk-experiment

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/htslib/issues/385#issuecomment-231083250, or mute the thread https://github.com/notifications/unsubscribe/ABM-v3mjhP9OW8kS-MzTSg5aC866P2Qnks5qTQRAgaJpZM4Iu3LF .

jeromekelleher commented 8 years ago

Ah OK, thanks for the clarification @jmarshall. The new function sounds very useful.

Thanks for looking at the code @bioinformed, this is very helpful. I'll sort out that leak.

jeromekelleher commented 8 years ago

As it turns out, I was able to get the information that I need by using a combination of pysam's fetch() and tell() methods. I have no real need for this low-level API now, so feel free to close this issue @jmarshall if you think this isn't of more general interest. Sorry for taking up your time, and thanks very much for the help --- I wouldn't have thought of the method I have if I wasn't trying to figure out the offsets that your method returned!