sdsc-ordes / modos-api

Python API to manage multi-omics digital objects
https://sdsc-ordes.github.io/modos-api
Apache License 2.0
0 stars 0 forks source link

region-based filtering not working in htsget #51

Closed cmdoret closed 4 months ago

cmdoret commented 5 months ago

htsget streaming seems to return the whole region. htsget-rs receives the region specification but seems to always return all records. Reported upstream: https://github.com/umccr/htsget-rs/issues/248

cmdoret commented 5 months ago

This is expected behaviour, as htsget-rs cannot slice within bgzf blocks. Confirmed on a larger test file:

# client side
In [22]: len([variant for variant in m2.stream_genomics('modo-demo/ex/htsget.cram', region='11')])
[E::cram_index_load] Could not retrieve index file for '/tmp/tmpwbgmqln7'
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/98c59049a2df285c76ffb1c6db8f8b96": Input/output error
Out[22]: 12898

In [23]: len([variant for variant in m2.stream_genomics('modo-demo/ex/htsget.cram', region='11:5099230-5099889')])
[E::cram_index_load] Could not retrieve index file for '/tmp/tmp2wm27_cm'
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/98c59049a2df285c76ffb1c6db8f8b96": Input/output error
Out[23]: 2898

Solution

Filter records on the client-side to only return those matching the query region