umccr / htsget-rs

A server implementation of the htsget protocol for bioinformatics in Rust
https://samtools.github.io/hts-specs/htsget.html
MIT License
39 stars 9 forks source link

Edit boundary blocks on the server side to drop non-requested data #238

Open brainstorm opened 7 months ago

brainstorm commented 7 months ago

Currently the htsget protocol assumes that some returned data blocks will contain reads that were not requested. This was an accepted trade-off in the spec but for some privacy usecases, those extra bytes should be dropped on the server side instead of the current situation where (presumably) the client just ignores those. From @mmalenic's input on the topic:

Basically, this occurs because the htsget-rs server only calculates byte ranges based on the index file (e.g .cram.crai). Since the index file does not contain all possible positions of the file that can be sliced, the server returns the smallest byte ranges that include the request. However, this means that often there are additional bytes included that aren't part of the request.

To address this we could have a strict mode which which would allow the htsget-rs server to inspect the file and only return the data that is required by the request. This would ensure that no additional data is present past the requested reference name, start and end regions.

mmalenic commented 5 months ago

Noting here that this doesn't only apply to Crypt4GH. It also applies to the bytes returned in BGZF files, which need to align to BGZF blocks.