samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

Bring htsjdk's BCF2Codec up-to-date against the latest spec, and add tests #628

Open droazen opened 8 years ago

droazen commented 8 years ago

BCF2Codec has not been well-maintained over the years, and does not fully support the latest BCF 2.2 spec (see the BCF section in http://samtools.github.io/hts-specs/VCFv4.3.pdf). We now have at least one htsjdk client (Intel) that wants to use the htsjdk BCF codec for performance reasons to ingest htslib output (which does support BCF 2.2), and even if we didn't it's worth bringing the codec up-to-date rather than continuing to distribute htsjdk with out-of-date BCF support.

droazen commented 8 years ago

For @cmnbroad

cmnbroad commented 8 years ago

Also, when this is finished we should undo https://github.com/samtools/htsjdk/pull/591.

akiezun commented 8 years ago

@droazen to be clear - we only need to be able to read BCF2.2 records created by htslib. I don't think we need to be able to write BCF2.2 for our usecase. Is that right?

droazen commented 8 years ago

@akiezun I believe so, yes (though we should confirm with the TileDB guys). In any event, the BCF2Codec is only capable of reading, so writing is not covered by this ticket.

heuermh commented 8 years ago

For what it is worth, our use case is to read BCF2.2 records created by htslib with htsjdk through Hadoop-BAM. Thanks for looking into this!

chriswhelix commented 8 years ago

Is there any sense of when this work might be completed? We have a similar requirement.

droazen commented 8 years ago

We really hope to be able to assign an engineer to work on this this quarter, but can't make any firm promises at this time. The work has been started (see https://github.com/samtools/htsjdk/pull/694 and https://github.com/cmnbroad/htsjdk/tree/cn_bcf2), but it's run into snags related to the fact that we need to maintain backwards compatibility for older versions of the VCF/BCF specs, but the htsjdk parsing code is unfortunately not well decomposed by version. A significant refactoring is needed to properly isolate the parsers for different versions from each other (and do an equivalent task on the writing end).

chriswhelix commented 8 years ago

@droazen thanks for the quick response! Is that branch functional for BCF2.2 support if we don't need compatibility with earlier formats?

droazen commented 8 years ago

@chriswhelix That branch is a work in progress that definitely shouldn't be used for anything except testing purposes -- @cmnbroad can provide more details on its current status.

cmnbroad commented 8 years ago

Its been a while since I've looked at it, but my recollection is that support for reading was mostly there, with the exception of one remaining BCF2.2. feature (end-of-vector marker ?). There is no write support at all. Anyway, its not finished; its pretty far behind master, and its certainly not tested.

chriswhelix commented 8 years ago

Thanks @cmnbroad. Really appreciate the responsiveness on this.

After an only mildly hellish tour through JNAerator, Bridj, and undocumented C code, I managed to get bindings to htslib working as a short term solution. Would definitely prefer to use htsjdk once it's updated.

agostof commented 3 years ago

Was additional development done to support BCF2.2?

cmnbroad commented 3 years ago

@agostof BCF2.2 is still not supported.