Open jmarshall opened 3 years ago
I'm not sure the quote is necessarily correct. It's the case that the zero length CIGARs weren't being used to update the bin linear index so the start..end ranges could be invalid for a zero-length read. However personally I view that as a bug and as far as bins go I believe it was working. Completely agree that the bin field should match the bin it was placed into for purposes of constructing the bin index during indexing.
I'm not sure what's missing though from the spec. It looks to give a reasonable description of how to use bins to answer queries. 5.1.1 answers that succintly as far as bins go and people could stop at that point and still get a working query (I think).
Where it gets fuzzy is linear indexes in 5.1.3. I'd like a better picture than the one used, showing bins, sequences, ranges covered and the linear index. Something similar to figure 1 in https://academic.oup.com/bioinformatics/article/26/14/1699/178142, but with linear index and multiple chunks instead of 1 as the 8-way tree gives holes between successive start..end ranges.
I have some suboptimal Tikz to produce this type of plot:
I'm thinking maybe it needs to be less cluttered, or perhaps one like this and then some zoomed up spy-glass style elements in which we can discuss the specifics. That said, it also needs some markers for genomic coordinates. Things that can be demonstrated visually would be:
Location of reads into bins, showing long reads end up in bigger bins or short reads if overlapping boundaries.
Linear index, obviously.
Highlighted bins that come back from an X-to-Y query (shading).
Chunks, how many and where they start / end. How we'd use them from an indexing perspective to ignore bins that overlap but don't have data covering the specific query.
Omission of empty bins from the index.
I finally found the source for this pic. I hunted high and low and then remembered it was on my home PC as viewing the PDF was much quicker there. It's pretty ghastly and not the ideal way to write it in tikz I'm sure, but here it is (ignore the header - it's just a cut down copy of the SAM spec to get boilerplate stuff in):
https://github.com/jkbonfield/hts-specs/blob/tikz_pic/pic.tex#L48-L141
As noted on PR #521, the specification does not fully describe how to use an index to check whether a particular read intersects with a query interval. In particular,
Originally posted in https://github.com/samtools/hts-specs/pull/521#issuecomment-677744051