Closed jrobinso closed 1 year ago
Just a comment since I know some of you participate in spec meetings, there doesn't seem to be any way to ask a server what output formats it supports for a given ID. I realize there are only 2 possible for reads, but that is more than 1 and clients should be able to know before requesting a particular format.
I think some servers could handle either BAM or CRAM output for the same dataset ID, but other cannot (such as this particular test server).
I think that this issue stems from the following S3 path confusion, so I just went and created a mixed
S3 subkey where you have BAM and CRAM together with the same ID:
(base) rvalls@m1 ~ % aws s3 ls s3://org.umccr.demo.htsget-rs-data/
PRE bam/
PRE bcf/
PRE cram/
PRE crypt4gh/
PRE events/
PRE vcf/
(base) rvalls@m1 ~ % aws s3 ls s3://org.umccr.demo.htsget-rs-data/cram/
2023-09-25 16:26:51 1627794 htsnexus_test_NA12878.cram
2023-09-25 16:26:51 129 htsnexus_test_NA12878.cram.crai
2023-09-28 15:32:18 693771 mt.sorted_cram.cram
2023-09-28 15:32:26 320 mt.sorted_cram.cram.crai
(base) rvalls@m1 ~ % aws s3 ls s3://org.umccr.demo.htsget-rs-data/bam/
2023-09-25 16:26:51 2596799 htsnexus_test_NA12878.bam
2023-09-25 16:26:51 6728 htsnexus_test_NA12878.bam.bai
2023-09-25 16:26:51 2034 htsnexus_test_NA12878.bam.blocks.yaml
2023-09-25 16:26:51 2616 htsnexus_test_NA12878.bam.gzi
2023-09-28 15:02:23 876871 mt.sorted.bam
2023-09-28 15:02:29 747632 mt.sorted.bam.bai
(base) rvalls@m1 ~ % aws s3 cp s3://org.umccr.demo.htsget-rs-data/bam/mt.sorted.bam s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.bam
copy: s3://org.umccr.demo.htsget-rs-data/bam/mt.sorted.bam to s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.bam
(base) rvalls@m1 ~ % aws s3 cp s3://org.umccr.demo.htsget-rs-data/bam/mt.sorted.bam.bai s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.bam.bai
copy: s3://org.umccr.demo.htsget-rs-data/bam/mt.sorted.bam.bai to s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.bam.bai
(base) rvalls@m1 ~ % aws s3 cp s3://org.umccr.demo.htsget-rs-data/cram/mt.sorted_cram.cram s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.cram
copy: s3://org.umccr.demo.htsget-rs-data/cram/mt.sorted_cram.cram to s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.cram
(base) rvalls@m1 ~ % aws s3 cp s3://org.umccr.demo.htsget-rs-data/cram/mt.sorted_cram.cram.crai s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.cram.crai
copy: s3://org.umccr.demo.htsget-rs-data/cram/mt.sorted_cram.cram.crai to s3://org.umccr.demo.htsget-rs-data/mixed/mt.sorted.cram.crai
(base) rvalls@m1 ~ % aws s3 ls s3://org.umccr.demo.htsget-rs-data/mixed/
2023-10-02 14:01:15 876871 mt.sorted.bam
2023-10-02 14:01:28 747632 mt.sorted.bam.bai
2023-10-02 14:02:25 693771 mt.sorted.cram
2023-10-02 14:02:36 320 mt.sorted.cram.crai
You'll see that our server behaves as expected in this situation:
curl 'https://htsget.demo.umccr.org/reads/org.umccr.demo.htsget-rs-data/mixed/mt.sorted?class=header' <--- Gets the BAM format by default, as per spec
curl 'https://htsget.demo.umccr.org/reads/org.umccr.demo.htsget-rs-data/mixed/mt.sorted?class=header&format=CRAM'
And to your point:
there doesn't seem to be any way to ask a server what output formats it supports for a given ID
From my point of view, the UnsupportedFormat
is reserved to formats that the (htsget) server does not support, as it only supports the subset {VCF,BCF,BAM,CRAM}
at the time of this writing , i.e, BED
is not supported at the moment:
% curl 'https://htsget.demo.umccr.org/reads/org.umccr.demo.htsget-rs-data/mixed/mt.sorted?class=header&format=BED'
{
"htsget": {
"error": "UnsupportedFormat",
"message": "bed isn't a supported format for this endpoint"
}
}
If I understand correctly, you'd want a hint for clients when an ID with multiple formats is found within the storage backend? As in, some draft behavior like the following?:
curl 'https://htsget.demo.umccr.org/reads/org.umccr.demo.htsget-rs-data/mixed/mt.sorted?class=header
{
"htsget": {
"message": "Multiple formats detected for the supplied ID, returning BAM by default",
"formats": ["BAM, CRAM"]
}
}
And then have a HTTP 302 or 303 code to redirect the client to BAM automatically as it does now by default? That way clients would have that hint and not break backwards compat?
Happy to propose it on hts spec if that's what you need/want, @jrobinso?
I agree with @brainstorm here. I'm not sure that UnsupportedFormat
is the correct error response because my interpretation of "The server SHOULD reply with an UnsupportedFormat error if the requested format is not supported" was for the server as a whole, rather than for a specific id. Although it could be interpreted both ways, I think NotFound
probably makes more literal sense because that key, e.g. cram/mt.sorted_cram.bam
, could not be found.
I also think this interpretation (the currently implemented one) aligns more closely with the HTTP status codes that UnsupportedFormat
and NotFound
map to. It's not that a request such as the example is malformed (indicating a 400 response), but rather that the underlying resource, in this case cram/mt.sorted_cram.bam
does not exist.
Either way, I can see the utility of having a hint for the formats that are available for a particular id. This doesn't necessarily need a spec change either, and it could just be present in the error response (whether that is UnsupportedFormat
or NotFound
), e.g.:
{
"htsget": {
"error": "NotFound",
"message": "id with key: `cram/mt.sorted_cram.bam` not found. The following formats are available for this id: ['BAM']"
}
}
Somewhat related to this (although not directly) is https://github.com/umccr/htsget-rs/issues/127.
This doesn't necessarily need a spec change either, and it could just be present in the error response (whether that is UnsupportedFormat or NotFound)
Well, I slightly disagree here: I'm not sure that message
response is the best place to embed operational hints that clients might use and rely upon... i.e: you'd have to keep that message stable for new upcoming releases and it sounds like a future breakage site :-S
Yeah that's a completely valid point. A hint would probably be best as an additional JSON field, which would require a spec change.
OK I will defer to you guys, I misinterpreted the spec. For my part I've updated the igv.js doc to be clear that CRAM and BCF are not supported for htsget. This is also true of IGV desktop as the htsjdk does not support them.
The following request
results in the following response
According to the spec
The server SHOULD reply with an UnsupportedFormat error if the requested format is not supported.
I realize it says SHOULD so this behavior isn't technically out of spec, but it can result in confusing error messages to client programs and their users. It doesn't appear that htsjdk supports decoding CRAM from htsget responses, so I was testing to see what happens if BAM is requested. I think some servers could handle either BAM or CRAM output for the same dataset ID, but other cannot (such as this particular test server). So its difficult to present an informative error message to end users without the specific error called for in the spec.