pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

iterating on unnamed streams #1297

Open cmdoret opened 1 month ago

cmdoret commented 1 month ago

Hello pysam devs

I would like to know whether pysam is the right tool for what I'd like to do:

tl;dr: I'd like to get an iterator of AlignmedSegments over an in-memory binary stream (no file, not stdio).

I'm streaming CRAM/BCF from an htsget server and wrote a client to lazily consume the binary stream. The client exposes it as a buffered read-only file object (io.RawIOBase) so that we can do:

with con.open() as stream:
    for chunk in stream:
        do_stuff(chunk) # : bytes

Now I wanted to make this stream easy to work with from python and thought I could just do pysam.AlignmentFile(stream), but it doesn't work because pysam requires that the input file is on disk, or has a file descriptor.

I found that this line in libcalignmentfile.pyx which indeed states this is not possible. Is this a missing feature, or just out of scope for pysam (e.g. is there a fundamental reason why it is not possible), or maybe I'm just taking the wrong approach?

Any feedback or suggestions welcome :)