samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
652 stars 174 forks source link

BCF2 Format -- missing values in vectors #232

Open yfarjoun opened 7 years ago

yfarjoun commented 7 years ago

I've been investigating some NPE rabbit hole in htsjdk (https://github.com/samtools/htsjdk/issues/961) and have ventured to the BCF2 part of the spec and found some confusing language.

In particular in Section 6.3.3 the following statements appear. In "Vectors":

One (or more) of the values in the vector may be missing, but others in the vector are not

and (in "Vectors of mixed length"):

...this means that it’s illegal to encode a vector VCF field with missing values

AFAICT the implementation in htsjdk excises missing data from all vectors in BCF. This can be seen here: htsjdk.variant.variantcontext.writer.BCF2FieldEncoder.GenericInts#encodeValue for integers and here htsjdk.variant.variantcontext.writer.BCF2FieldEncoder.Float#encodeValue for floats.

My sense is that this is incorrect. That there needs to be a way to represent [1, ., 2] and [2.3,.,4.1]...but as far as I can tell the confusion in the spec has lead to an implementation (in java) that doesn't support that.

I'm happy with implicitly removing the missing data at the end of a vector. These can be put back in if needed if the length is known or left out if the length is UNBOUNDED, but not allowing (or quietly removing) missing data in the middle of a vector seems crazy!

I'm happy to update the language in one way or another...but first I'd like to hear folks' opinions on what is the right thing to do.

yfarjoun commented 7 years ago

As far as I can tell, the problem that the spec is trying to deal with is the encoding of variable length (i.e. UNBOUNDED) vectors in the genotypes. So you effectively have a vector (of size n_samples) of unbounded vectors, thus the inner vectors are of unknown length. However, it also seems that a solution for this has been thought of in the spec as the End-Of-Vector is there defined for both INT and FLOAT (I'm not sure how to encode a variable length vector of strings in the genotype field, since vectors of strings are supposed to be separated with a comma)

For an UNBOUNDED vector, am I correct in thinking that the right encoding should end with the "end-of-vector" reserved value? (the examples in the spec do not encode an unbounded vector)

pd3 commented 7 years ago

This is from historic reasons. The end of vector value was added to the specification later, in the original draft only a missing value was supported. However, the end-of-vector is necessary to allow vectors with missing values in BCF, such fields are common when merging VCFs. There was a strong opposition against this from GATK developers, because it complicates life greatly. They made the practical choice of treating such vectors as completely missing to avoid complications in downstream processing.

yfarjoun commented 7 years ago

This is an interesting historical perspective... @droazen @lbergelson @tfenne care to chime in ?

As I see it BCF as implemented in htjdk is almost unusable as it is currently implemented....

yfarjoun commented 7 years ago

For UNBOUNDED Strings in the genotype field, HTSJDK currently fills the shorter vectors with nulls so that all strings are of the same length.

pd3 commented 7 years ago

Yes, that's correct, string FORMAT fields should be null-padded.

yfarjoun commented 7 years ago

I see why the GATK team opposed the change, as it requires using "boxed" Integers in the decoder instead of arrays of primitive ints, or even worse, emulate the vcf parser and give a vector of strings that need to be parsed by the consumer.

Not sure where this issue is going... :-(