samtools / htslib

C library for high-throughput sequencing data formats
Other
810 stars 446 forks source link

[Clarification][Feature-request] Get chr:beginPos-endPos from byte-range #1437

Closed arfaoui47 closed 2 years ago

arfaoui47 commented 2 years ago

I'm wondering if it's possible to do a reverse lookup using the index file generated by tabix. I have the byte range from HTTP Request header and I want to deduce chr:beginPos-endPos from this range.

daviesrob commented 2 years ago

In theory you could do this by searching the index for the entries that match the byte range, but you wouldn't get a very accurate answer. As the index divides references into 4096 (or possibly larger for CSI) base bins, the start position in the byte range may point to data before the actual range requested. When tabix outputs the data, it simply discards the small number of unwanted records until it gets to the part that was asked for. Also, as the HTTP reader buffers up data before sending it on to be decompressed, it's very likely that it will have run past the actual file position where the range ended. If you asked for two regions that are fairly close together then the requests for them may also have been merged together, either because they start and end in the same BGZF block, or to avoid the overhead of opening a new HTTP connection.

So I think you could get an approximate answer, but not a very accurate one. If this is good enough depends on exactly what question you're trying to answer with the reverse lookup.

arfaoui47 commented 2 years ago

Thank you for the detailed explanation. Using the index to search for genome region until it matches the byte range wouldn't work in my case. I'm currently building an artificial VCF file with a variant each 100 base pair, then I'm indexing it using tabix. This file is used in IGV web; each time I select a genome region, I got the byte range send as in HTTP request header; I wanted to use this data to get the genome region (that's what I meant by reverse lookup); Since the index is a mapping from genome region (chr:beginPos-endPos) to byte-range in VCF file. I was wondering if there is a way to get the genome region from the byte-range. Maybe since the file is uniform, the mapping from genome region to byte-range is linear?

daviesrob commented 2 years ago

I think you'd have to try this experimentally - i.e. make a file, look at various locations in IGV and watch to find out which byte ranges are requested. I suspect the ranges for nearby locations will all start at the same file position, and it's quite possible that they may end at the same position too.

Note that IGV's behaviour may be quite different to tabix in terms of what it requests. Also, if it uses methods like prefetching and caching to speed up its interface, you may find that some selections don't trigger a request because it already has the data it needs.

jkbonfield commented 2 years ago

The index is an R-tree with a series of bins mapping genome coordinates to byte ranges. The format for the index is at the end of the SAM specification (https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf). You may be using the CSI variant instead of BAI, which has it's own (too brief!) specification on hts-specs, but it's a minimal change to the BAI one so that still helps you.

Basically there are bins that cover large ranges, with child bins convering successively smaller ranges. This means a few seeks and queries to use such an index. There is also the linear index too, which is an alternative way of jumping to a genomic starting location. It may prove more useful for you, but unlike the binning index it's not so good at handling objects of widely varying size (eg think of a SAM file with a long ONT read and a small Illumina read, with the former being in higher up bins).

A final note of warning, as far as I recall, the binning index isn't guaranteed to be sorted on disk! I vaguely recall this being an issue before, but the logic for omitting this escapes me. Tools generally load the entire index into memory so I guess it's never been a problem.