samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Adding a PEDIGREE header only works if no PEDIGREE headers are present #1479

Closed bartcharbon closed 4 years ago

bartcharbon commented 4 years ago

Description of the issue:

Programmatically adding a PEDIGREE header to a VCFHeader will only work if no PEDIGREE headers are already present. This makes it impossible to add multiple PEDIGREE headers.

Your environment:

htsjdk 2.21.3 java 11 Windows 10

Steps to reproduce

example input files: examples.zip

    VCFHeader header = reader.getFileHeader();
    VariantContextWriterBuilder vcWriterBuilder = new VariantContextWriterBuilder().clearOptions()
        .setOutputFile(new File("path/to/output.vcf"));
    VariantContextWriter writer = vcWriterBuilder.build();
    header.addMetaDataLine(new VCFPedigreeHeaderLine("<ID=test1,Original=GermlineID>", VCFHeaderVersion.VCF4_3));
    header.addMetaDataLine(new VCFPedigreeHeaderLine("<ID=test2,Original=GermlineID>", VCFHeaderVersion.VCF4_3));
    writer.writeHeader(header);
    writer.close();

Expected behaviour

both pedigree lines are added to the vcf header in the output file

Actual behaviour

cmnbroad commented 4 years ago

@bartcharbon Your test files are marked as v4.2, but they contain PEDIGREE header lines that are formatted as v4.3 PEDIGREE lines (meaning they are "structured", with an ID field). Although the v4.2 spec doesn't define PEDIGREE header lines as structured lines (though v4.3 does), those lines are accepted by the v4.2 parser as generic header lines since they're otherwise legal, but they are not modeled internally as VCFPedigreeHeaderLine objects. Your test code is then attempting to add v4.3 VCFPedigreeHeaderLine objects to the v4.2 header. Its this mix of objects and versions, along with htsjdk's awkward VCFHeader modeling, that is causing inconsistent behavior.

There is probably a code path here that should produce a better error message, but I'd recommend trying to make the files consistent with the spec/versions and see if that resolves the problem. Also note that htsjdk does not have write support for v4.3, only read, so it will not automatically upconvert a v4.2 file or header to v4.3. If you read in a 4.2 file, it remains a v4.2 file in memory and on write.

bartcharbon commented 4 years ago

Thank you for your quick response, I tested with 4.3 input and got "Writing VCF version VCF4_3 is not implemented".

My bad, I overlooked the fact that 4.3 writing is not (yet?) supported.