umccr / htsget-rs-cli

Rust htsget client
0 stars 0 forks source link

advice on deserializing cram from htsget #3

Open cmdoret opened 1 month ago

cmdoret commented 1 month ago

Hey!

I've stumbled upon this project as we've started using your excellent htsget-rs server implementation :)

Now that we're trying to lazily consume the stream of CRAM/BCF on the client side (in python using a file-like interface), I think we're running into limitations of pysam, as it can only parse records from the filesystem (https://github.com/pysam-developers/pysam/issues/1297).

I was thinking of (trying to) make python mappings for noodles-htsget to parse the stream lazily on the client side. As you've apparently been working on the problem, I would be interested to hear your thoughts on the matter.

The aim would be essentially to get a python library that exposes a lazy iterator over the htsget stream. In spirit, very similar (I think) to what you started in this repo, maybe some simple interface like:

con = HtsgetConnection.from_url(
  'http://localhost:8080/reads/file?format=CRAM&referenceName=chr1&start=103&end=1320'
)
with con.open() as stream:
  for record in stream:
    print(record.start)
brainstorm commented 1 month ago

Hi @cmdoret, great question!

I was planning to get a Rust noodles CRAM+Crypt4GH iterator example for you but then I just realised that perhaps the python side is more important to you? If that's the case, I'd look into noodles-htsget crate and PyO3/maturin, here are some resources:

https://pyo3.github.io/pyo3/v0.20.0/getting_started.html

I don't have plans to put together and support those Python bindings myself, but do keep me posted, totally interested and happy to help if you get stuck!

Thanks again for poking into that repo, reminded me I should tilt it back and move it forward to completion :)

/cc @mmalenic

cmdoret commented 1 month ago

Hi @brainstorm, thanks for your answer !

I've started something at https://github.com/cmdoret/htslurp to try and make python bindings. I got the server to stream bytes from rust to python, but I can't yet figure out how to get a Record iterator on that stream in the noodles api. If you have an idea, an example would be very welcome :))

Then my plan was to make a struct that wraps noodles' CRAM/BCF records and defines python mappings. My impression is that it's probably easier to implement everything in rust and only expose to python:

I guess from python it could look something like (not yet implemented):

import htslurp
iterator = htslurp.stream('https://localhost/htsget/reads/file?format=CRAM')
for rec in iterator:
  type(rec) # -> htslurp.AlignmentRecord

where htslurp.AlignmentRecord would have ~ the same fields as noodles::cram::Record

Does this approach make sense to you?