samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
649 stars 174 forks source link

IDX fields in BCF2.2's VCF header #591

Open h-2 opened 3 years ago

h-2 commented 3 years ago

I am currently implementing a stand-alone BCF implementation and am having some trouble with the IDX fields. These are specified quite poorly if I may say so.

This is what the spec has to say:

In order to avoid this costly operation, a new IDX field can be used to explicitly define the position
which is dropped on BCF-to-VCF conversion. If not present, the implicit relative position is assumed.
If the IDX field is present in one record, it must be present also in all other dictionary-defining records.
The IDX tag is not necessary in newly created BCF files, but if present, the numbering must match
the implicit dictionary of tags.

The example seems to suggest the following if IDX is not present:

Now, I am really unsure what happens when IDX come into play.

If I take the example from the spec and run it through bcfools convert, I get the following header:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x,IDX=0>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data",IDX=1>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth",IDX=2>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency",IDX=3>
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele",IDX=4>
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129",IDX=5>
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership",IDX=6>
##FILTER=<ID=q10,Description="Quality below 10",IDX=7>
##FILTER=<ID=s50,Description="Less than 50% of samples have data",IDX=8>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype",IDX=9>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality",IDX=10>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth",IDX=2>
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality",IDX=11>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003

As you can see, there are two entries with IDX=0 and two entries with IDX=2.

Is this behaviour intended? Obviously the numbering is not strictly increasing, so what does "the numbering must match the implicit dictionary of tags" mean here?

Thank you for your help!

pd3 commented 3 years ago

This is a side effect of the way the dictionary is coded in HTSlib, which was the first implementation of BCF. The IDX feature that allows to drop tags from the header was added only later and the specification reflects this non-ideal situation. I agree it's awkward, but the behavior is intended.

h-2 commented 3 years ago

Thank you for your reply!

The IDX feature that allows to drop tags from the header was added only later and the specification reflects this non-ideal situation. I agree it's awkward, but the behavior is intended.

Yes, this looks a lot like "we have implemented this hack to do X" and "oh, the format is now incompatible to before, let's just standardize the hack"...

Anyway, just to have this clear:

  1. IDX for CONTIGs is "counted" separately from IDX for the other fields? (IDX for contigs isn't mentioned in the spec at all)
  2. IDX for the other fields is the same for identical ID strings?

Since this is not clearly defined in the spec, do you think that these two points are required to write files that are compatible with htslib and bcftools or is this merely the "flavor" of implementation used there?

pd3 commented 3 years ago

Not exactly a naive hack like that. The code that later became HTSlib was developed first and unofficially more or less defined what BCF format looks like. In the original implementation it was not possible to remove header lines without recoding the whole BCF body and adding the IDX tag this way was the least problematic way to allow it, otherwise more significant and breaking changes in core structures of HTSlib and the format itself would have to be made, and HTSlib was already out there in public at that point.

I should emphasize that this less hacky than it seems. The VCF specification clearly states that there are two dictionaries, one for contigs and one for strings. Therefore the first occurence of the tag X across all FILTER, INFO, FORMAT columns defines its index, and it is independent of X defined in contigs and vice versa. This results in two number sequences, as in this example:

##FILTER=<ID=PASS,Description="..",IDX=0>
##contig=<ID=X,IDX=0>
##contig=<ID=Y,IDX=1>
##FILTER=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##FILTER=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
##INFO=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##INFO=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
##FORMAT=<ID=X,Number=1,Type=String,Description="..",IDX=1>
##FORMAT=<ID=Y,Number=1,Type=String,Description="..",IDX=2>
jkbonfield commented 3 years ago

I think we should migrate this to the hts-specs repository (@pd3 - see Transfer Issue in github interface, or I can do it if you wish). The question is about the language in the VCF specification as much as it is about one specific implementation. It definitely needs improving. It doesn't even state it's numeric, let alone that it starts from 0 rather than 1 and counts up sequentially. There are no examples containing this, so it's not something a reader can determine short of using specific implementations of the specification. I understand that the software came first, but the spec should be accessable as the primary source to understanding the format.

I have questions though that need considering for when we're updating the specification.

  1. Is this a bcftools/htslib-only thing, or does htsjdk also handle this in the same manner? If the former, it should probably be dropped from the specification entirely given extra program-specific header fields are permitted as it's currently underspecified. If it's used by other tools, then it'll need a major overhaul in the spec.

  2. What is a "dictionary-defining record"? It's confusng because the paragraph above (see original quote) mentions ##dictionary but you example doesn't have that. Is it meaning anything with an "ID" key (ie not ##fileDate for example). This needs tightly defining.

  3. The example describing IDX doesn't actually use IDX! Why not? It just adds more confusion.

  4. What does "Note that the dictionary encoding has the magic prefix 's' here to indicate that the field's value is actually in the dictionary entry giving by the subsequent offset" mean? I see s0, s1, etc in the record lines, which I assume it refers to, but I think it adds more confusion. If BCF itself is using binary 0, 1, 2 etc then why not just use that, or <0> and some comment about meaning the binary encoded ? Adding an "s" in front of it seems to imply it's stored in string form. It also seems arbitrary given the contig recoding was written as "0" and not "s0". Does BCF encode numbers differently for the fixed columns vs in the INFO/FORMAT columns?

  5. "Note that ``PASS'' is always implicitly encoded as the first entry in the header dictionary". Does that mean PASS should always be IDX=0 and our IDX fields should therefore start from 1 to avoid clashes? That's not what I see in your example above however so I assume no, but I don't understand how these internal dictionary values and the s0, s1 etc values correspond to the explicit IDX values. I get the idea, that if we wanted to reheader a BCF by inserting or removing entire lines then the only way of avoiding rewriting every BCF record is to provide a translation lookup table in the header (hence IDX), but I don't get why PASS isn't in there given our INFO fields are starting from 1 and not 0. (Edit: sorry I missed it - I see that it is there. My mistake)

  6. (From above) "The VCF specification clearly states that there are two dictionaries, one for contigs and one for strings". I'd argue that's not clear at all! I've reread the bit of the specification explaining IDX and see no such explicit explanation. The example does show contig as "0" and PASS as "s0", implying we have two lists, but this also doesn't seem to be the same two lists you're referring to above (I think).

  7. Also, from your above example "This results in two number sequences, as in this example". I see the following separate number sequences: contig (0,1), FILTER (0,1,2), INFO (1,2) and FORMAT (1,2). That's not two number sequences, but four. Two of which mysteriously start counting from 1 and two from 0.

pd3 commented 3 years ago

@jkbonfield

EDIT: struggling with the markdown renumbering the bullet points. Now it should be correct.

jkbonfield commented 3 years ago

Ah wait - I think I understand now (having noticed the mystery of why IDX=2 in this section of the top post:

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality",IDX=10>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth",IDX=2>
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality",IDX=11>

@pd3's example has duplicate numbering for FILTER / INFO / FORMAT fields because the ID= strings are duplicated too. Ie as we have the same ID string used in FILTER and INFO then they get the same numeric value in the lookup table. So I assume the implementation is basically a hash table insertion with auto-increment if not found, using two hashes (one for contigs and one for everything else). If this is so, we should probably define it in exactly those terms so at least it's obvious to software developers what's going on.

Also, is that mandatory, or just implementation specific? If I had a BCF implementation with separate hash tables for INFO and FORMAT, would that actually be malformed when the key strings are duplicated? It's a subtle point and I can see how you could create an implementation without realising the mistake for a long time. I just looked at a local BCF file with 248 ID entries and not a single one clashed, so you'd get a long way without knowing you had a bug just waiting to happen the first time someone created a reference named "GT".

jkbonfield commented 3 years ago
* re 5. That is correct. The VCF specification makes the definition of the filter `PASS` optional, therefore it must be treated as implicitly present even if it does not have to explicitly appear in the header and is `IDX=0`.

Practically speaking, what does this mean if we don't have a PASS filter? If it's not in the header and not used in the records, does it still get allocated IDX 0 and all subsequent values start from 1? Or is it something more generic such as any FILTER column can be used without a corresponding FILTER header tag, meaning we have to cope with scanning for filter keywords first and preallocate them in IDX?

Also the only thing mentioning that PASS doesn't need to be defined in the header (other than implicitly by numerous examples missing it out) is the BCF encoding bit. IMO this should be explicitly stated in the VCF part of the specification. I couldn't even find anything in the VCF bit that states the other values must be defined. It explains the format of the header line, but not whether it is required. You'd hope that's taken as read, but we know it's not the case as ##contig for example is entirely optional (sadly).

Is it permitted to leave these lines out?

pd3 commented 3 years ago

Even if PASS does not appear in the body, it still must be implicitly defined by the dictionary, otherwise on the first encounter of a filter one wouldn't be able to tell from the index which one we are looking at.

The VCF format allows optional definition of tags. The BCF format does not and cannot, PASS is the only exception.So I think it is alright to have the implicit index of the PASS filter mentioned in the BCF section only. I agree the language should be improved so this is stated not just as an example.

The laziness of the VCF format is another issue. I discourage the use of undefined tags because it is a problem for programs like BCFtools (or anything relying on HTSlib), but this is how VCF was originally formulated. Note that I am describing the format, not saying I like these quirks.

h-2 commented 3 years ago

So I assume the implementation is basically a hash table insertion with auto-increment if not found, using two hashes (one for contigs and one for everything else). If this is so, we should probably define it in exactly those terms so at least it's obvious to software developers what's going on.

Yes, that's the conclusion I came to as well. It would be really helpful if that were stated explicitly.

Also, is that mandatory, or just implementation specific? If I had a BCF implementation with separate hash tables for INFO and FORMAT, would that actually be malformed when the key strings are duplicated? It's a subtle point and I can see how you could create an implementation without realising the mistake for a long time. I just looked at a local BCF file with 248 ID entries and not a single one clashed, so you'd get a long way without knowing you had a bug just waiting to happen the first time someone created a reference named "GT".

Yes, that is exactly what I meant above. Regardless of the answer, I have now redesigned my code to reflect the behaviour of htslib -- just to be safe.

Regarding "PASS", I am now starting the numbering for strings at 1 and always adding a PASS entry with IDX 0. I hope that's correct.