samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
641 stars 174 forks source link

{tabix,csi}: specs are incomplete #70

Open kortschak opened 9 years ago

kortschak commented 9 years ago

The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.

kortschak commented 9 years ago

ping @lh3?

bicycle1885 commented 8 years ago

No updates here? I often encounter the same problem due to ambiguous descriptions in the specs.

brentp commented 7 years ago

Is there any update on this?

The indexer in htslib for CSI stores a tbx_conf_t and all the sequence names in the free-form meta/auxiliary field, but that behavior is completely undocumented.

kortschak commented 6 years ago

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment here indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree). This means that BAI and CSI do not agree on the meaning of depth. This is sort of why this kind of documentation is important, for actually doing science.

lbergelson commented 6 years ago

@jkbonfield @yfarjoun @jmarshall I've run into this same problem of underspecified CSI when reviewing an htsjdk pull request to add support for CSI indexes. (See https://github.com/samtools/htsjdk/pull/1040. )

Particularly, the details of the dummy bin in CSI are completely missing from the spec.

jkbonfield commented 6 years ago

I'll have to dig into the code to figure out what goes in the dummy bin, and also to see how the levels compare. CSI isn't something I've dealt with, although I'll be exploring it soon in other work.

This is really the realm of @pd3 though as I think he worked more on CSI.

jmarshall commented 6 years ago

The dummy bin contents are the same as in BAI and calculating its bin number is the natural generalisation from the BAI calculation, and came up on samtools-help last December.

Over the last few weeks I've been dusting off my old draft of adding CSI and tabix index documentation alongside the BAI documentation. That should be a PR soon…

jmarshall commented 6 years ago

At present the page counts of our main specification documents are:

Format #pages
SAM 20
SAMtags 7
CRAM 26
VCF 36

As BGZF compression is used by other formats in addition to BAM, I've been thinking of pulling the descriptions of BGZF and indexing out of the SAM spec into a separate document. This would remove ~5 pages from SAMv1.pdf into a new ~8 page BGZF/indexing spec. Ideologically-nice to separate them, but it's not like SAMv1.pdf is one of our bigger documents to start with…

Thoughts? :+1: for a separate BGZF.pdf document, :-1: for leaving it where it is within SAMv1.pdf.

magicDGS commented 6 years ago

:+1 for separating. For me it will be clear that way, because the formats using bgzip are not SAM-specific. In addition, the BGZF.pdf document might be a good place to include tabix index instead of in it's own document.

Another option might be to have a "Indexing.pdf" document, for all indexes (.bai, .crai, .csi, tabix...).

jkbonfield commented 6 years ago

I'd agree for making it a separate spec too. It always felt like an odd bolt-on.

That reminds me - the CRAM spec wants a major revamp. What we have right now is the scary child of Microsoft, EBI and an xslt config which converted the XML embedded in the original EBI cram.docx file to LaTeX. Horrific! :-)

jmarshall commented 6 years ago

@magicDGS: Yes, my thought process moved in that same direction — improve CSI.pdf, tabix.pdf so they have actual descriptive text ⇒ put that CSI, tabix descriptive text alongside the BAI text they mostly duplicate & nuke the CSI*.pdf, tabix.pdf micro-documents ⇒ split out BAI/CSI/tabix indexing into a single separate document ⇒ move BGZF into that separate document too.

(.crai remains CRAM-specific.)

Thanks both for using the vote button on the previous comment 😛

jmarshall commented 6 years ago

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment \<here> indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree).

As for levels and depth: there are indeed six different levels for the bins in a BAI index, and the htslib code tends to think about them as level0level5 — and the depth parameter in the functions in the ancient embryonic CSIv1.pdf document is zero-based accordingly (which is unhelpfully unnoted).

So there's no actual problem or inconsistency here, just a lack of clarity.

kortschak commented 4 years ago

The code quoted here:

/* calculate maximum bin number -- valid bin numbers range within [0,bin_limit) */
int bin_limit(int min_shift, int depth)
{
    return ((1 << (depth+1)*3) - 1) / 7;
}

gives 299593 when the SAM spec-specified depth of 6 is used, instead of SAM spec-specified value of 37449 which is given with a depth of 5.

This is beyond absence of clarity.

kortschak commented 4 years ago

Is there any chance that the substantive issues that are the OP will be addressed? This is nearly 5 years old now.