Open yfarjoun opened 7 years ago
As far as I can tell, the problem that the spec is trying to deal with is the encoding of variable length (i.e. UNBOUNDED) vectors in the genotypes. So you effectively have a vector (of size n_samples) of unbounded vectors, thus the inner vectors are of unknown length. However, it also seems that a solution for this has been thought of in the spec as the End-Of-Vector is there defined for both INT and FLOAT (I'm not sure how to encode a variable length vector of strings in the genotype field, since vectors of strings are supposed to be separated with a comma)
For an UNBOUNDED vector, am I correct in thinking that the right encoding should end with the "end-of-vector" reserved value? (the examples in the spec do not encode an unbounded vector)
This is from historic reasons. The end of vector value was added to the specification later, in the original draft only a missing value was supported. However, the end-of-vector is necessary to allow vectors with missing values in BCF, such fields are common when merging VCFs. There was a strong opposition against this from GATK developers, because it complicates life greatly. They made the practical choice of treating such vectors as completely missing to avoid complications in downstream processing.
This is an interesting historical perspective... @droazen @lbergelson @tfenne care to chime in ?
As I see it BCF as implemented in htjdk is almost unusable as it is currently implemented....
For UNBOUNDED Strings in the genotype field, HTSJDK currently fills the shorter vectors with nulls so that all strings are of the same length.
Yes, that's correct, string FORMAT fields should be null-padded.
I see why the GATK team opposed the change, as it requires using "boxed" Integers in the decoder instead of arrays of primitive ints, or even worse, emulate the vcf parser and give a vector of strings that need to be parsed by the consumer.
Not sure where this issue is going... :-(
I've been investigating some NPE rabbit hole in htsjdk (https://github.com/samtools/htsjdk/issues/961) and have ventured to the BCF2 part of the spec and found some confusing language.
In particular in Section 6.3.3 the following statements appear. In "Vectors":
and (in "Vectors of mixed length"):
AFAICT the implementation in htsjdk excises missing data from all vectors in BCF. This can be seen here: htsjdk.variant.variantcontext.writer.BCF2FieldEncoder.GenericInts#encodeValue for integers and here htsjdk.variant.variantcontext.writer.BCF2FieldEncoder.Float#encodeValue for floats.
My sense is that this is incorrect. That there needs to be a way to represent [1, ., 2] and [2.3,.,4.1]...but as far as I can tell the confusion in the spec has lead to an implementation (in java) that doesn't support that.
I'm happy with implicitly removing the missing data at the end of a vector. These can be put back in if needed if the length is known or left out if the length is UNBOUNDED, but not allowing (or quietly removing) missing data in the middle of a vector seems crazy!
I'm happy to update the language in one way or another...but first I'd like to hear folks' opinions on what is the right thing to do.