samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

GL field handling inconsistent between encoder and decoder #1371

Closed mitochon closed 2 years ago

mitochon commented 5 years ago

Verify

Somewhat related issue but not the same.

Subject of the issue

Getting an error reading GL fields written using htsjdk for multi-sample VCF.

Your environment

Steps to reproduce

When creating multi sample VCF, the encoder puts in ".,.,." for missing GL fields, e.g.

chr1    123 .   G   A   .   .   .   GT:GL:MIN_DP    0/1:0,0.9,0.1   ./.:.,.,.:10

Specifically there's this comment in the encoder code https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFEncoder.java#L346

// If we have a missing field but multiple values are expected, we need to construct a new string with all fields.
// For example, if Number=2, the string has to be ".,."

The VCFCodec does not honor this contract and I get an error when trying to parse the above line:

java.lang.NumberFormatException: For input string: "."
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at htsjdk.variant.variantcontext.GenotypeLikelihoods.parseDeprecatedGLString(GenotypeLikelihoods.java:264)
    at htsjdk.variant.variantcontext.GenotypeLikelihoods.fromGLField(GenotypeLikelihoods.java:90)
    at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:722)
    at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:132)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:352)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:284)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)

The GenotypeLikelihoods code is expecting a single "." regardless of the ploidy and the number of alleles.

Expected behaviour

VCFCodec should handle the case of multiple missing GL fields, especially since its the output generated by VCFEncoder.

Actual behaviour

See above comments

Additional information

This is the code snippet used to generate the entry above

import htsjdk.variant.vcf.*;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class Test {

    public static void main(String[] args) {

        final VCFHeaderLine versionHeader = new VCFHeaderLine(VCFHeaderVersion.VCF4_2.getFormatString(), VCFHeaderVersion.VCF4_2.getVersionString());
        final VCFFormatHeaderLine gtHeader = new VCFFormatHeaderLine("GT", 1, VCFHeaderLineType.String, "Genotype");
        final VCFFormatHeaderLine glHeader = new VCFFormatHeaderLine("GL", VCFHeaderLineCount.G, VCFHeaderLineType.Float, "GL");
        final VCFFormatHeaderLine minDpHeader = new VCFFormatHeaderLine("MIN_DP", 1, VCFHeaderLineType.Integer, "Min DP");

        final Set<VCFHeaderLine> headerLines = new HashSet<>();
        headerLines.add(versionHeader);
        headerLines.add(gtHeader);
        headerLines.add(glHeader);
        headerLines.add(minDpHeader);

        final Genotype gt1 = new GenotypeBuilder("s1",
                Arrays.asList(Allele.create("G", true), Allele.create("A")))
                .attribute("GL", "0,0.9,0.1")
                .make();

        final Genotype gt2 = new GenotypeBuilder("s2",
                Arrays.asList(Allele.NO_CALL, Allele.NO_CALL))
                .attribute("MIN_DP", "10")
                .make();

        final GenotypesContext gtx = GenotypesContext.create(2);
        gtx.add(gt1);
        gtx.add(gt2);

        final VariantContext vc = new VariantContextBuilder(null, "chr1", 123, 123,
                Arrays.asList(Allele.create("G", true), Allele.create("A")))
                .genotypes(gtx)
                .make();

        final Set<String> samples = new HashSet<>();
        samples.add("s1");
        samples.add("s2");

        final VCFEncoder encoder = new VCFEncoder(new VCFHeader(headerLines, samples), false, false);
        System.out.println(encoder.encode(vc));
    }
}

Should output

chr1    123 .   G   A   .   .   .   GT:GL:MIN_DP    0/1:0,0.9,0.1   ./.:.,.,.:10
mitochon commented 5 years ago

VCFHeaderVersion.VCF4_2 See also the code snippet in the additional information section for more details.

lbergelson commented 5 years ago

This was fixed by #1372 and #1389

lbergelson commented 2 years ago

Closing since I believe this was fixed.