samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
285 stars 243 forks source link

HtsGetVCFReader implementation? #1555

Open brainstorm opened 3 years ago

brainstorm commented 3 years ago

Description of the issue:

While implementing support for htsget in IGV-desktop, we (@reisingerf and @brainstorm) noticed that there's no htsgetVariantsReader or any similar reader akin to htsgetbamreader.

In other words, we need to wrap htsjdk variant readers for VCF/BCF like we did for BAM (working on the IGV PR/branch referred above): https://github.com/igvteam/igv/blob/67a0a8be361b92cfeabc1048c9be26f5a37c7ec6/src/main/java/org/broad/igv/sam/reader/HtsgetBAMReader.java

We suspect that there's similar ongoing work to attain this on https://github.com/samtools/htsjdk/pull/1551 ?... although there's no explicit mention of "htsget", the issue seems to suggest an abstraction away from regular files on local disk.

/cc @mlin @ohofmann @jrobinso @lindenb @lbergelson

jrobinso commented 3 years ago

Hey all. If the motivation for this is IGV I think I could implement support in IGV in a fairly straightforward manner with a custom FeatureSource, the implementation would use a VCF codec from Tribble but everything else in IGV.

For background I created Tribble, the concept of "codecs" for parsing features, etc with input from Mark DePristo (who named it after a star trek episode) so I have background there. I could implement this in the htsjdk but the overhead for me to do so is large enough that it would take longer, and be harder for me to squeeze in.

brainstorm commented 3 years ago

I love that TOS Star Trek episode :)

Jim, that's what me and @reisingerf suspected after a fair bit of head scratching going through IGV's code. Although I'd prefer this feature to be supported within htsjdk itself (design principles), I totally get and respect your analysis.

So If no htsjdk developer steps up for this, could you finish up this part within IGV via tribbles? We could then close the GA4GH htsget-supported-in-BOTH-IGV.js-and-IGV-desktop support meta-ticket for good, I reckon :)

jrobinso commented 3 years ago

Sure, although I'm leaving for vacation tomorrow so it might be a few weeks. Do we have an issue for this in IGV?

I had not seen that star trek episode, but Mark imagined codecs proliferating like tribbles (if that's what they are called). "Picard" also has a ST connection.

brainstorm commented 3 years ago

No rush on this, I just opened the issue on IGV, enjoy the break Jim!

Bioinformatics is so full of tribbles and many troubles... most of them are not as cute as the ST furry animals though.

brainstorm commented 7 months ago

@lbergelson @lindenb I'm currently implementing support for htsget:// loading of (remote) assets in Hartwig's hmftools:

https://github.com/umccr/hmftools/compare/89de3d8a19276ed23d568ceeb6dc41de9fdbd675..17ade02e451be6100d5c69e4d8ec78d65a607215

I've got BAM reading through our htsget server working alright, but looking for SamInputReader.of() equivalents for VCFs (VcfInputReader.of()??), I bumped again into this old issue I had while working with IGV (also saw progress made in https://github.com/samtools/htsjdk/pull/1473 & https://github.com/samtools/htsjdk/pull/1551).

Is there currently a suitable alternative/factory to read VCF/BCF assets through htsget in a similar way than with SAM/BAM? samtools.SeekableStreamFactory seems to hold the most promise but can't see HTSGET defined as case (only FTP/HTTP/HTTPS?)... SamInputReader does support HTSGET natively.

/cc @ohofmann @reisingerf @mmalenic