samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

VCF4.2 PEDIGREE header parsing broken #1400

Closed brainstorm closed 5 years ago

brainstorm commented 5 years ago

Related issues: https://github.com/samtools/htsjdk/pull/835

Description of the issue:

2.19.0-48-g13818ba-SNAPSHOT introduced in IGV HEAD (master) branch (ping @igvteam @jrobinso) has a change in the VCF header parser that breaks VCF4.2 PEDIGREE header line parsing (both locally and via cloud file loading), here's the backtrace:

htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Invalid VCFSimpleHeaderLine: key=PEDIGREE name=null, for input source: /Users/romanvg/tmp/test.vcf.gz
        at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:97) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:82) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:117) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:90) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at org.broad.igv.track.TribbleFeatureSource.getFeatureSource(TribbleFeatureSource.java:113) ~[main/:?]
        at org.broad.igv.track.TribbleFeatureSource.getFeatureSource(TribbleFeatureSource.java:69) ~[main/:?]
        at org.broad.igv.track.TrackLoader.loadVCF(TrackLoader.java:305) ~[main/:?]
        at org.broad.igv.track.TrackLoader.loadTribbleFile(TrackLoader.java:400) ~[main/:?]
        at org.broad.igv.track.TrackLoader.load(TrackLoader.java:215) [main/:?]
        at org.broad.igv.ui.IGV.load(IGV.java:1432) [main/:?]
        at org.broad.igv.ui.IGV.loadResources(IGV.java:1364) [main/:?]
        at org.broad.igv.ui.IGV$4.run(IGV.java:475) [main/:?]
        at org.broad.igv.util.LongRunningTask.call(LongRunningTask.java:72) [main/:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.IllegalArgumentException: Invalid VCFSimpleHeaderLine: key=PEDIGREE name=null
        at htsjdk.variant.vcf.VCFSimpleHeaderLine.initialize(VCFSimpleHeaderLine.java:104) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.variant.vcf.VCFSimpleHeaderLine.<init>(VCFSimpleHeaderLine.java:91) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.variant.vcf.VCFPedigreeHeaderLine.<init>(VCFPedigreeHeaderLine.java:10) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.variant.vcf.AbstractVCFCodec.getPedigreeHeaderLine(AbstractVCFCodec.java:286) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:206) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at org.broad.igv.feature.tribble.VCFWrapperCodec.readActualHeader(VCFWrapperCodec.java:93) ~[main/:?]
        at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:95) ~[htsjdk-2.19.0-48-g13818ba-SNAPSHOT.jar:2.19.0-48-g13818ba-SNAPSHOT]
        ... 16 more
ERROR [2019-07-11T16:40:38,876]  [IGV.java:1367] [pool-3-thread-4]  Error loading track

The offending VCF has the following bits in the header:

##fileformat=VCFv4.2
(...)
##PEDIGREE=<Derived=PRJ190376_SFRC01160-S1-19-1434,Original=PRJ190375_SFRC01160-B1>

So the change that broke it for us was commited month ago:

https://github.com/samtools/htsjdk/blame/d5ac8634fa7a74c0275842ff3826c819259d5d12/src/main/java/htsjdk/variant/vcf/VCFSimpleHeaderLine.java#L44

Which is effectively mandating the ID field, as per VCF spec 4.3, but ignoring the ##fileformat=VCFv4.2 in the process and therefore breaking VCF 4.2 backwards compatibility (a sizeable amount of VCFs stored at our facilities, @UMCCR).

Your environment:

Steps to reproduce

$ git clone https://github.com/igvteam/igv.git && cd igv && ./gradlew run

Then load a VCF with the aforementioned features.

Expected behaviour

The VCF pedigree header reader should respect version 4.2 (lack of formality/spec), not mandate ID fields for 4.3 and 4.2. It should check the version header first.

Actual behaviour

Backtrace as stated above and VCF not loaded on IGV.

/cc @ohofmann @vladsaveliev @reisingerf @andrewpatto

cmnbroad commented 5 years ago

Yes, thanks for reporting this. This change hasn't surfaced in a release yet, but we'll get a fix in for the upcoming release.

brainstorm commented 5 years ago

Fantastic! Thanks for such a fast fix! ;)