Open raduvanciu opened 9 years ago
Hi raduvanciu ,
I do not see in the vcf-spec any discussion of missing values in a integer array. Is there something I missed, or are you proposing a change to the spec? Perhaps bcftools should emit the value you consider as "reasonable" instead?
Cheers,
Yossi.
On Thu, Oct 1, 2015 at 7:23 PM, raduvanciu notifications@github.com wrote:
When dealing with multiple samples, bcftools merge fills in the genotype PL field with '.' constants to account for samples where the variant is not present.
When the method decodeInts parses a PL field that has dots it return null.
For example, decodeInts("146,.,.,0,.,40") returns null Instead, it should return: [146,2147483647,2147483647,0,2147483647,40] or some other value instead of Integer.MAX_VALUE such that the min(PL) remains unchanged.
The following version of decodeInts fixes the problem;
private static final int[] decodeInts(final String string) { List
split = ParsingUtils.split(string, ','); int [] values = new int[split.size()]; for (int i = 0; i < values.length; i++) { try { values[i] = Integer.parseInt(split.get(i)); } catch (final NumberFormatException e) { values[i] = Integer.MAX_VALUE; } } return values; } — Reply to this email directly or view it on GitHub https://github.com/samtools/htsjdk/issues/340.
@yfarjoun: §6.3.3: For each integer size, the value with all bits set[sic] (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. See also the imminent VCF v4.3 spec for additions in this area.
hmmm. given that section 6 is for BCF encoding, I'm not sure it applies in general. There should be a vcf comment about this, I think.
On Fri, Oct 2, 2015 at 5:36 AM, John Marshall notifications@github.com wrote:
@yfarjoun https://github.com/yfarjoun: §6.3.3: For each integer size, the value with all bits set[sic https://github.com/samtools/hts-specs/issues/102](0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. See also the imminent VCF v4.3 spec for additions in this area.
— Reply to this email directly or view it on GitHub https://github.com/samtools/htsjdk/issues/340#issuecomment-144975328.
In general, things that are expressible in one of VCF/BCF are also expressible in the other; if not, that's a bug.
§1.4.1 (Fixed fields) surely suffices: In all cases, missing values are specified with a dot (‘.’). You may wish to raise an issue pointing out that it should be clarified that this applies to genotype fields.
I am aware of the missing field, but here the request is that we interpret a missing part of a non-missing field as something else.
Even in the BCF case it is implementation dependent (depending on the number of bytes given to the field) but in vcf it is unspecified (and indeed not mentioned) therefore, I surmise that an array of values in vcf that has missing values in it, is not a missing value, and therefore all its values should be valid (just like in BCF, all the values are valid). A "." is not a valid value in a integer array...
Y.
On Fri, Oct 2, 2015 at 6:51 AM, John Marshall notifications@github.com wrote:
In general, things that are expressible in one of VCF/BCF are also expressible in the other; if not, that's a bug.
§1.4.1 (Fixed fields) surely suffices: In all cases, missing values are specified with a dot (‘.’). You may wish to raise an issue pointing out that it should be clarified that this applies to genotype fields.
— Reply to this email directly or view it on GitHub https://github.com/samtools/htsjdk/issues/340#issuecomment-144989173.
The text I quoted says missing values, not missing field. Please also contemplate the HQ
arrays at position 14370 in the example in §1.1 of the spec.
Anyway, the point is that this is what bcftools merge
does, and it believes it is emitting valid VCF.
So either htsjdk needs to deal with it, or you need to convince the bcftools maintainers that they're wrong — if the latter, please explain how the genotype PL field ought to be written to account for samples where the variant is not present.
(And either way, apparently the spec needs clarifying. But that's not news :smile:)
John,
position 14370 has both values missing, which is equivalent to having the whole field missing, I do not think that this is a good example for the matter at hand.
The text that you quoted was for fixed fields, in which case value and field is the same thing..This is exactly the issue here, so I do not think that it clarifies things. I agree that PL field for the mixed genotypes is hard to fill, but that is exactly why it should be dealt with in the program, and not in the spec. Taking the maximum value might work for PL, but not for other things (like AD)...so I do not think that a missing value inside the genotype fields should be automatically interpreted as the maximal value.
I do think that this might require clarification in HTS-spec or VCF-spec because (as you know) unlike double, int do not have a NaN value...setting the parse to replace any missing value with the maximal value without further changes to the parser will lead to errors.
Y.
On Fri, Oct 2, 2015 at 7:24 AM, John Marshall notifications@github.com wrote:
The text I quoted says missing values, not missing field. Please also contemplate the HQ array at position 14370 in the example in §1.1 of the spec.
Anyway, the point is that this is what bcftools merge does, and it believes it is emitting valid VCF.
So either htsjdk needs to deal with it, or you need to convince the bcftools maintainers that they're wrong — if the latter, please explain how the genotype PL field ought to be written to account for samples where the variant is not present.
(And either way, apparently the spec needs clarifying. But that's not news [image: :smile:])
— Reply to this email directly or view it on GitHub https://github.com/samtools/htsjdk/issues/340#issuecomment-144994079.
John and Yossi,
Thank you for looking into this.
You raised some interesting points:
-1
might work.decodeInts(".")
should return null in the patch I proposed. I am still investigating where this change propagates and maybe use a better default for AD. At the very least, the code would need a similar solution for GL array (using NAN as default). I would really appreciate if you can help me there.
I am considering other alternatives as well:
a) using Integer[]
instead of int[]
with null
for null values. Unfortunately, this may slow things down a bit, and the change would propagate all over.
b) add a field in GenotypeBuilder
that preserves original positions of PL values as a bit array (maybe encoded as a int). That would be useful for missing values in GL
as well. The field would need to be used when the PL is decoded.
What do you think?
Radu
P.S.: Here is an updated patch (possibly incomplete).
public LazyGenotypesContext.LazyData createGenotypeMap(final String str,
final List<Allele> alleles,
final String chr,
final int pos) {
...
} else if (gtKey.equals(VCFConstants.GENOTYPE_ALLELE_DEPTHS)) {
gb.AD(decodeInts(genotypeValues.get(i),-1));
} else if (gtKey.equals(VCFConstants.GENOTYPE_PL_KEY)) {
gb.PL(decodeInts(genotypeValues.get(i),Integer.MAX_VALUE));
...
}
private static final int[] decodeInts(final String string, final int nullValue) {
if (VCFConstants.MISSING_VALUE_v4.equals(string))
return null;
List<String> split = ParsingUtils.split(string, ',');
int [] values = new int[split.size()];
for (int i = 0; i < values.length; i++) {
try {
values[i] = Integer.parseInt(split.get(i));
} catch (final NumberFormatException e) {
values[i] = nullValue;
}
}
return values;
}
It's not a great example of the question at hand, but nonetheless I'm sure you noticed that the spec has written it as 1/1:43:5:.,.
rather than 1/1:43:5:.
The text I quoted was for fixed fields including the INFO field, where certainly value and field are not necessarily the same thing. You should consider that what applies to INFO applies in general to genotype fields; and again, you may wish to raise an issue to clarify the spec here so that the next reader fares better.
Certainly one should not just use the maximum value for these missing items, as other fields are not necessarily summarisable with min
as you've both noticed. This is why the spec defines a concept of missing values, so that programs can treat them as really missing and implement min(PL)
etc to skip such items.
And missing values are available within vectors too. This is spelt out in the spec in §6.3.3 (Type encoding) for BCF files: see the Vectors subsection and note the (imperfectly-worded) paragraph:
One (or more) of the values in the vector may be missing, but others in the vector are not. Here each value should be represented in the vector, and each corresponding BCF2 vector value either set to its present value or the type equivalent MISSING value.
Clearly there has to be some way to represent this in VCF too, and whether or not the imperfect spec spells it out clearly [1], that way is to write the vector item as .
, just like other missing things in VCF.
To be sure, int
does not have a NaN value, so we have to steal one. As noted, BCF steals (for integers) 0x80
/0x8000
/0x80000000
to represent missing, and unsurprisingly htslib represents missing integers in vectors in memory by using the same special values and treats them as effectively NaNs.
The bug here is that HTSJDK isn't doing something similar, either by Radu's (a) or (b) or by using the same sentinel values as BCF and htslib.
What do you think?
I think (a) use Integer[] instead of int[] with null for null values, or the hacky equivalent of using 0x80/etc instead of null to avoid the slowdowns.
[1] Clearly @yfarjoun would like to volunteer to improve this :smile:
When dealing with multiple samples,
bcftools merge
fills in the genotype PL field with '.' constants to account for samples where the variant is not present.When the method
decodeInts
parses a PL field that has dots it return null.For example,
decodeInts("146,.,.,0,.,40")
returnsnull
Instead, it should return:[146,2147483647,2147483647,0,2147483647,40]
or some other value instead of Integer.MAX_VALUE such that the min(PL) remains unchanged.The following version of
decodeInts fixes
the problem;