Closed cmdoret closed 4 months ago
This is expected behaviour, as htsget-rs cannot slice within bgzf blocks. Confirmed on a larger test file:
# client side
In [22]: len([variant for variant in m2.stream_genomics('modo-demo/ex/htsget.cram', region='11')])
[E::cram_index_load] Could not retrieve index file for '/tmp/tmpwbgmqln7'
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/98c59049a2df285c76ffb1c6db8f8b96": Input/output error
Out[22]: 12898
In [23]: len([variant for variant in m2.stream_genomics('modo-demo/ex/htsget.cram', region='11:5099230-5099889')])
[E::cram_index_load] Could not retrieve index file for '/tmp/tmp2wm27_cm'
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/98c59049a2df285c76ffb1c6db8f8b96": Input/output error
Out[23]: 2898
Solution
Filter records on the client-side to only return those matching the query region
htsget streaming seems to return the whole region. htsget-rs receives the region specification but seems to always return all records. Reported upstream: https://github.com/umccr/htsget-rs/issues/248