sdsc-ordes / modos-api

Python API to manage multi-omics digital objects
https://sdsc-ordes.github.io/modos-api
Apache License 2.0
0 stars 0 forks source link

feat: htsget client #78

Closed cmdoret closed 3 months ago

cmdoret commented 4 months ago

Context

The htsget streaming implementation is suboptimal, as it forces greedy download of the slice and stores it on disk. This is due to limitations of the reference htsget client not allowing lazy consumption and pythonic interfaces.

Proposed changes:

This PR implements a minimal htsget client and refactors the streaming code to use it instead of the GA4GH implementation.

It also simplifies the streaming logic and reorders modules to improve coherence and coupling:

Note

For more technical details on how the htsget server and client work, see the module docstring of modos.genomics.htsget.

Limitations

This PR makes it possible to lazily consume the htsget stream, however parsing with pysam requires dumping ths whole slice to a temporary file. This is due to lack of bytesIO support in pysam (documented in https://github.com/sdsc-ordes/modos-api/pull/78/commits/ed2f8a1023fc9ab7ab3c323bfc2db252ce351cc9).

Should pysam or another library implement parsing of in-memory CRAM/BCF bufffers, it would be trivial to adapt the code.