Open h-2 opened 3 years ago
This is a side effect of the way the dictionary is coded in HTSlib, which was the first implementation of BCF. The IDX feature that allows to drop tags from the header was added only later and the specification reflects this non-ideal situation. I agree it's awkward, but the behavior is intended.
Thank you for your reply!
The IDX feature that allows to drop tags from the header was added only later and the specification reflects this non-ideal situation. I agree it's awkward, but the behavior is intended.
Yes, this looks a lot like "we have implemented this hack to do X" and "oh, the format is now incompatible to before, let's just standardize the hack"...
Anyway, just to have this clear:
Since this is not clearly defined in the spec, do you think that these two points are required to write files that are compatible with htslib and bcftools or is this merely the "flavor" of implementation used there?
Not exactly a naive hack like that. The code that later became HTSlib was developed first and unofficially more or less defined what BCF format looks like. In the original implementation it was not possible to remove header lines without recoding the whole BCF body and adding the IDX tag this way was the least problematic way to allow it, otherwise more significant and breaking changes in core structures of HTSlib and the format itself would have to be made, and HTSlib was already out there in public at that point.
I should emphasize that this less hacky than it seems. The VCF specification clearly states that there are two dictionaries, one for contigs and one for strings. Therefore the first occurence of the tag X
across all FILTER, INFO, FORMAT columns defines its index, and it is independent of X
defined in contigs and vice versa. This results in two number sequences, as in this example:
##FILTER=<ID=PASS,Description="..",IDX=0>
##contig=<ID=X,IDX=0>
##contig=<ID=Y,IDX=1>
##FILTER=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##FILTER=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
##INFO=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##INFO=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
##FORMAT=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##FORMAT=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
I think we should migrate this to the hts-specs repository (@pd3 - see Transfer Issue in github interface, or I can do it if you wish). The question is about the language in the VCF specification as much as it is about one specific implementation. It definitely needs improving. It doesn't even state it's numeric, let alone that it starts from 0 rather than 1 and counts up sequentially. There are no examples containing this, so it's not something a reader can determine short of using specific implementations of the specification. I understand that the software came first, but the spec should be accessable as the primary source to understanding the format.
I have questions though that need considering for when we're updating the specification.
Is this a bcftools/htslib-only thing, or does htsjdk also handle this in the same manner? If the former, it should probably be dropped from the specification entirely given extra program-specific header fields are permitted as it's currently underspecified. If it's used by other tools, then it'll need a major overhaul in the spec.
What is a "dictionary-defining record"? It's confusng because the paragraph above (see original quote) mentions ##dictionary
but you example doesn't have that. Is it meaning anything with an "ID" key (ie not ##fileDate
for example). This needs tightly defining.
The example describing IDX doesn't actually use IDX
! Why not? It just adds more confusion.
What does "Note that the dictionary encoding has the magic prefix 's' here to indicate that the field's value is actually in the dictionary entry giving by the subsequent offset" mean? I see s0, s1, etc in the record lines, which I assume it refers to, but I think it adds more confusion. If BCF itself is using binary 0, 1, 2 etc then why not just use that, or <0>
and some comment about
"Note that ``PASS'' is always implicitly encoded as the first entry in the header dictionary". Does that mean PASS should always be IDX=0 and our IDX fields should therefore start from 1 to avoid clashes? That's not what I see in your example above however so I assume no, but I don't understand how these internal dictionary values and the s0, s1 etc values correspond to the explicit IDX values. I get the idea, that if we wanted to reheader a BCF by inserting or removing entire lines then the only way of avoiding rewriting every BCF record is to provide a translation lookup table in the header (hence IDX), but I don't get why PASS isn't in there given our INFO fields are starting from 1 and not 0. (Edit: sorry I missed it - I see that it is there. My mistake)
(From above) "The VCF specification clearly states that there are two dictionaries, one for contigs and one for strings". I'd argue that's not clear at all! I've reread the bit of the specification explaining IDX and see no such explicit explanation. The example does show contig as "0" and PASS as "s0", implying we have two lists, but this also doesn't seem to be the same two lists you're referring to above (I think).
Also, from your above example "This results in two number sequences, as in this example". I see the following separate number sequences: contig (0,1), FILTER (0,1,2), INFO (1,2) and FORMAT (1,2). That's not two number sequences, but four. Two of which mysteriously start counting from 1 and two from 0.
@jkbonfield
re 2. From the logic of the dictionary, a dtictionary-defining record for tag X
is the first record that defines the string X
, be it FILTER, INFO, or FORMAT.
re 1. The dictionary is implicit and unless one removes lines from the header, the IDX tags are not needed. I am also curious whether htsjdk can read BCFs with removed lines and if it allows the operation of line removal.
re 3. That's correct, the example of IDX tag should be added to the specification
re 4. I don't know anything about this one
re 5. That is correct. The VCF specification makes the definition of the filter PASS
optional, therefore it must be treated as implicitly present even if it does not have to explicitly appear in the header and is IDX=0
.
re 6. There are two sections about it, 6.2.1 and 6.2.2. Of course the language can be improved etc. The point is there is a logic behind having two numbering sequences.
re 7. See "re 2" and "re 6" above.
EDIT: struggling with the markdown renumbering the bullet points. Now it should be correct.
Ah wait - I think I understand now (having noticed the mystery of why IDX=2 in this section of the top post:
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality",IDX=10>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth",IDX=2>
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality",IDX=11>
@pd3's example has duplicate numbering for FILTER / INFO / FORMAT fields because the ID=
strings are duplicated too. Ie as we have the same ID string used in FILTER and INFO then they get the same numeric value in the lookup table. So I assume the implementation is basically a hash table insertion with auto-increment if not found, using two hashes (one for contigs and one for everything else). If this is so, we should probably define it in exactly those terms so at least it's obvious to software developers what's going on.
Also, is that mandatory, or just implementation specific? If I had a BCF implementation with separate hash tables for INFO and FORMAT, would that actually be malformed when the key strings are duplicated? It's a subtle point and I can see how you could create an implementation without realising the mistake for a long time. I just looked at a local BCF file with 248 ID entries and not a single one clashed, so you'd get a long way without knowing you had a bug just waiting to happen the first time someone created a reference named "GT".
* re 5. That is correct. The VCF specification makes the definition of the filter `PASS` optional, therefore it must be treated as implicitly present even if it does not have to explicitly appear in the header and is `IDX=0`.
Practically speaking, what does this mean if we don't have a PASS filter? If it's not in the header and not used in the records, does it still get allocated IDX 0 and all subsequent values start from 1? Or is it something more generic such as any FILTER column can be used without a corresponding FILTER header tag, meaning we have to cope with scanning for filter keywords first and preallocate them in IDX?
Also the only thing mentioning that PASS doesn't need to be defined in the header (other than implicitly by numerous examples missing it out) is the BCF encoding bit. IMO this should be explicitly stated in the VCF part of the specification. I couldn't even find anything in the VCF bit that states the other values must be defined. It explains the format of the header line, but not whether it is required. You'd hope that's taken as read, but we know it's not the case as ##contig
for example is entirely optional (sadly).
Is it permitted to leave these lines out?
Even if PASS does not appear in the body, it still must be implicitly defined by the dictionary, otherwise on the first encounter of a filter one wouldn't be able to tell from the index which one we are looking at.
The VCF format allows optional definition of tags. The BCF format does not and cannot, PASS is the only exception.So I think it is alright to have the implicit index of the PASS filter mentioned in the BCF section only. I agree the language should be improved so this is stated not just as an example.
The laziness of the VCF format is another issue. I discourage the use of undefined tags because it is a problem for programs like BCFtools (or anything relying on HTSlib), but this is how VCF was originally formulated. Note that I am describing the format, not saying I like these quirks.
So I assume the implementation is basically a hash table insertion with auto-increment if not found, using two hashes (one for contigs and one for everything else). If this is so, we should probably define it in exactly those terms so at least it's obvious to software developers what's going on.
Yes, that's the conclusion I came to as well. It would be really helpful if that were stated explicitly.
Also, is that mandatory, or just implementation specific? If I had a BCF implementation with separate hash tables for INFO and FORMAT, would that actually be malformed when the key strings are duplicated? It's a subtle point and I can see how you could create an implementation without realising the mistake for a long time. I just looked at a local BCF file with 248 ID entries and not a single one clashed, so you'd get a long way without knowing you had a bug just waiting to happen the first time someone created a reference named "GT".
Yes, that is exactly what I meant above. Regardless of the answer, I have now redesigned my code to reflect the behaviour of htslib -- just to be safe.
Regarding "PASS", I am now starting the numbering for strings at 1 and always adding a PASS entry with IDX 0. I hope that's correct.
I am currently implementing a stand-alone BCF implementation and am having some trouble with the IDX fields. These are specified quite poorly if I may say so.
This is what the spec has to say:
The example seems to suggest the following if IDX is not present:
Now, I am really unsure what happens when IDX come into play.
If I take the example from the spec and run it through bcfools convert, I get the following header:
As you can see, there are two entries with IDX=0 and two entries with IDX=2.
Is this behaviour intended? Obviously the numbering is not strictly increasing, so what does "the numbering must match the implicit dictionary of tags" mean here?
Thank you for your help!