samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Explanation of construction of binning and linear indexes #1585

Closed marianogabitto closed 1 year ago

marianogabitto commented 1 year ago

Hi All, I was looking for an explanation of how the binning indexes are built. I read the Tabix paper and I just don't understand it. Is there any place in which I can read how this is constructed? Can you also point me to the section of the code in which indexes are built?

Thanks, Mariano

daviesrob commented 1 year ago

The tabix index format can be found here, but you're best off reading it in conjunction with the description of how the very similar BAI index works in the SAM specification as it gives rather more detail. The main difference between the two is that the TBI index includes a bit of information about the file that was indexed.

The main functions involved in constructing the index are hts_idx_init() to set everything up; hts_idx_push() which is called for every record you want to add to the index; and hts_idx_finish() which adds the last entries and does a bit of optimisation.

Most of the work happens in hts_idx_push(). It's called with the tid, beg and end for the current record along with the virtual file offset of the end of the current record. Its job is to work out which R-tree bin corresponds to the range [beg..end) using hts_reg2bin(), and then call insert_to_l() and insert_to_b() to update the file offsets in the linear and binning (R-tree) indexes respectively. insert_to_l() gets called for each record and simply records the offset of the start of the record if it's the first one seen in a given linear index slot. insert_to_b() is a bit more complicated as its only actually called when the R-tree bin changes from the last one seen, and the data added is that for the previous bin. That trick allows hts_idx_push() to accumulate adjacent records into a single chunk in the given R-tree bin.

If hts_idx_push() discovers that the tid has changed it does a bit more work. It grows the data structures so that it can store the entries for the new reference; and also stores some metadata in a pseudo-bin as described in section 5.2 of the SAM specification.

After the last record, hts_idx_finish() needs to be called so that the final offsets can be added to the binning index, and the last pseudo-bins updated. It also fills in holes in the linear index and calls compress_binning() to optimise the binning index. The optimisation isn't strictly necessary, but is done to reduce the number of bins that reference very small file regions by pushing them up into parent bins.

Finally, hts_idx_save_as() converts the internal data structures to the on-disk format described in the specification and writes it out.

If you want to see the contents of an index file, you can do it by building a copy of HTSlib with the DEBUG_INDEX macro defined, e.g. like this:

make clean; make CC='cc -DDEBUG_INDEX'

Then any program that loads or saves an index (e.g. tabix) will call the idx_dump() function to write the contents of the index (be warned that this could be a lot of output for big indexes!) You could also try @cmdcolin's BAM index visualizer although note that it only supports BAI indexes, so won't understand the ones made by tabix.