The htsget streaming implementation is suboptimal, as it forces greedy download of the slice and stores it on disk. This is due to limitations of the reference htsget client not allowing lazy consumption and pythonic interfaces.
Proposed changes:
This PR implements a minimal htsget client and refactors the streaming code to use it instead of the GA4GH implementation.
It also simplifies the streaming logic and reorders modules to improve coherence and coupling:
introduces a modos.genomics module, offloading many functions from modos.helpers
remove file_utils
remove any import of modos.api (public api) into other modules
Note
For more technical details on how the htsget server and client work, see the module docstring of modos.genomics.htsget.
Context
The htsget streaming implementation is suboptimal, as it forces greedy download of the slice and stores it on disk. This is due to limitations of the reference htsget client not allowing lazy consumption and pythonic interfaces.
Proposed changes:
This PR implements a minimal htsget client and refactors the streaming code to use it instead of the GA4GH implementation.
It also simplifies the streaming logic and reorders modules to improve coherence and coupling:
modos.genomics
module, offloading many functions frommodos.helpers
modos.api
(public api) into other modulesNote
For more technical details on how the htsget server and client work, see the module docstring of
modos.genomics.htsget
.Limitations
This PR makes it possible to lazily consume the htsget stream, however parsing with pysam requires dumping ths whole slice to a temporary file. This is due to lack of bytesIO support in pysam (documented in https://github.com/sdsc-ordes/modos-api/pull/78/commits/ed2f8a1023fc9ab7ab3c323bfc2db252ce351cc9).
Should pysam or another library implement parsing of in-memory CRAM/BCF bufffers, it would be trivial to adapt the code.