umccr / htsget-rs

A server implementation of the htsget protocol for bioinformatics in Rust
https://samtools.github.io/hts-specs/htsget.html
MIT License
38 stars 9 forks source link

regions has no effect when streaming BCF #248

Open cmdoret opened 4 months ago

cmdoret commented 4 months ago

Hello and thanks for providing this excellent implementation of htsget! We are trying to use it together with a minio S3 storage and so far it worked well, however we noticed that when requesting a specific region, the /variants endpoint returned all records regardless. I believe this is a bug in htsget-rs based on the server logs, but I may also have mis-interpreted them or mis-used htsget-rs. Do you have any advice or suggestion on where the issue might be?

Environment:

Steps to reproduce:

  1. Download example file https://github.com/vcflib/vcflib/blob/master/samples/sample.vcf
  2. Process file into bcf:
    bgzip sample.vcf
    bcftools index sample.vcf.gz 
    bcftools convert -Ob -o abc.bcf sample.vcf.gz
    bcftools index abc.bcf
    bcftools index -s abc.bcf.csi
    19      .       2
    20      .       6
    X       .       1
  3. Upload bcf + csi index into s3 bucket
  4. Send a GET request to htsget for contig "19"
  5. Send a GET request to htsget for contig "20"

Observed behaviour: Both queries returned all variant records from all contigs. Expected behaviour: Only variants of the requested chromosome are returned.

Observations:

Log from the contig 19 request shows that the query was parsed properly, and that segments (10,16) were requested.

2024-05-29T11:42:02.765895Z  INFO HTTP request{http.method=GET http.route=/variants/{id:.+} http.flavor=1.0 http.scheme=http http.host=htsget:8080 http.client_ip=172.23.0.5
http.user_agent=python-requests/2.31.0 http.target=/variants/ex/abc?referenceName=19&start=1000&end=5000&format=BCF otel.name=HTTP GET /variants/{id:.+} otel.kind="server" request_id=a9d7330
0-be43-4965-b972-ebfbef74041c}:variants{request=Query({"end": "5000", "referenceName": "19", "format": "BCF", "start": "1000"}) path=Path("ex/abc") http_request=
HttpRequest HTTP/1.0 GET:/variants/ex/abc
  query: ?"referenceName=19&start=1000&end=5000&format=BCF"
  params: Path { path: Url { uri: /variants/ex/abc?referenceName=19&start=1000&end=5000&format=BCF, path: None }, skip: 16, segments: [("id", Segment(10, 16))] }
  headers:
    "host": "htsget:8080"
    "connection": "close"
    "accept": "*/*"
    "accept-encoding": "gzip, deflate, br"
    "user-agent": "python-requests/2.31.0"
}: htsget_actix::handlers::get: variants endpoint GET request request=Request { path: "ex/abc", query: {"end": "5000", "referenceName": "19", "format": "BCF", "start": "1000
"}, headers: {"host": "htsget:8080", "connection": "close", "accept": "*/*", "accept-encoding": "gzip, deflate, br", "user-agent": "python-requests/2.31.0"} }

Logs from the contig 20 query show that the same segments (10,16) were requested, although the query is different. I am not sure whether I interpret this properly.

2024-05-29T11:56:32.734937Z  INFO HTTP request{http.method=GET http.route=/variants/{id:.+} http.flavor=1.0 http.scheme=http http.host=htsget:8080 http.client_ip=172.23.0.5
http.user_agent=python-requests/2.31.0 http.target=/variants/ex/abc?referenceName=20&start=1000&end=5000&format=BCF otel.name=HTTP GET /variants/{id:.+} otel.kind="server" request_id=baa2d58
8-3098-4ff5-acc7-decb9490403b}:variants{request=Query({"referenceName": "20", "start": "1000", "format": "BCF", "end": "5000"}) path=Path("ex/abc") http_request=
HttpRequest HTTP/1.0 GET:/variants/ex/abc
  query: ?"referenceName=20&start=1000&end=5000&format=BCF"
   params: Path { path: Url { uri: /variants/ex/abc?referenceName=20&start=1000&end=5000&format=BCF, path: None }, skip: 16, segments: [("id", Segment(10, 16))] }
  headers:
    "host": "htsget:8080"
    "user-agent": "python-requests/2.31.0"
    "accept": "*/*"
    "accept-encoding": "gzip, deflate, br"
    "connection": "close"
}: htsget_actix::handlers::get: variants endpoint GET request request=Request { path: "ex/abc", query: {"referenceName": "20", "start": "1000", "format": "BCF", "end": "5000
"}, headers: {"host": "htsget:8080", "user-agent": "python-requests/2.31.0", "accept": "*/*", "accept-encoding": "gzip, deflate, br", "connection": "close"} }
mmalenic commented 4 months ago

Hi @cmdoret, thanks for the detailed issue.

I had a look into this, and I believe that it is expected behaviour. This is because the sample file is small, and when compressing it using BGZF, only one block is produced. Since htsget must return a valid file (i.e. byte data cannot be cut across BGZF blocks), any request with that file will always result in the full file being returned, as there is only one BGZF block to return.

This can be confirmed by creating a GZI index for the file, and reading its contents:

bgzip -r abc.bcf
hexdump -C abc.bcf.gzi

Which shows 0 entries, indicating there is only one block:

00000000  00 00 00 00 00 00 00 00                           |........|
00000008

Unfortunately, I don't think there's a nicer way to read GZI files using command line tools, although you can use noodles:

fn main() {
  let index = noodles::bgzf::gzi::read("abc.bcf.gzi").unwrap();
  println!("{}", index.len());
}

Which outputs:

1

If you try a different file with more BGZF blocks (e.g. the htsnexus_test_NA12878.bam.gzi file), it should show more blocks:

fn main() {
  let index = noodles::bgzf::gzi::read("data/bam/htsnexus_test_NA12878.bam.gzi").unwrap();
  println!("{}", index.len());
}

Which outputs:

164

In effect, this number determines the maximum amount of "slices" that can be returned by htsget for a given file.

Note, that the segments in the logs refer to the HTTP request itself, i.e. the number of segments in the path, and don't mean anything in terms of the underlying BCF file. Apologies, the logs are currently a bit noisy, which I'm aiming to improve (see #250).

I think that this issue is related to #238. This feature is currently not supported, but there are plans to implement a mode that allows the htsget server to edit out data that was not requested.

Is this a feature that you would be interested in?

cmdoret commented 4 months ago

Thanks for your explanation, @mmalenic. Indeed, this does not seem to happen on larger files (feel free to close the issue).

I understand and that makes sense, in that case we will filter on client side. As we are not yet dealing with crypt4gh, this is not a priority for us right now. But we would definitely be interested in the future, and I think this would indeed make htsget-rs very interesting for health-related projects.

I'd love to contribute, but will need to step up my rust knowledge by then :sweat_smile: