Standardize htsget:// scheme on spec

brainstorm commented 3 years ago

Briefly mentioned on https://github.com/igvteam/igv/pull/850#issuecomment-707727642, discussed on some of the htsget meetings, GA4GH's Slack #fasp channel (as of today) and possibly on other issues in this repo (please refer to them if so), it is unclear if the htsget:// scheme should officially appear on the spec?

From a client perspective it'd be advantageous to discern the protocol right away but this might have other (unforeseen?) side effects?

/cc @jb-adams @ohofmann @jrobinso /cc @CastilloDel @mmalenic

andrewpatto commented 3 years ago

Hi, this was discussed today in the FASP call - and we thought we'd put some notes for further discussion:

The call agreed there was probably a need for a htsget:// scheme and that DRS could serve these up i.e there does not appear to be a natural mechanism in DRS to expose the richness of the htsget protocol by returning (for example) https:// URLs of htsget endpoint - so something calling it out as a special protocol scheme seems needed

If DRS was to serve up some form of htsget:// url, we think the client would need it (the url) to contain

the base URL (hostname?) of a htsget service
the ID of the object in question

Is there other info the url would need to contain? Does the htsget:// url also contain paths to /reads or /variants or is the concept of it specifying the htsget [base] a good one?

(there was then a discussion about the naming of the /reads and /variants endpoints - and if these are custom then how could they be discovered - at the moment even the service-info endpoints are not at a 'known' spot if custom paths are used)

jb-adams commented 3 years ago

Looping in @mlin @jmarshall @daviesrob . Please add in anyone else who should be on this thread

jrobinso commented 3 years ago

I'm not really in this loop, and am a bit confused on the status of htsget:// URIs. These are already used in the 2.24.1 (and probably earlier) release of the htsjdk, but I am inferring from the existence of this thread that they aren't standardized yet. The htsjdk recognizes the following as an endpoint to an alignment file, by recognizes I mean accepts

htsget://htsget.ga4gh.org/reads/giab.NA12878.NIST7086.1

jmarshall commented 3 years ago

This was discussed in the 2020-10-27 htsget meeting. Clients like samtools given an URL like https://srv.example.com:8080/g1kdata/reads/NA12878?referenceName=chr1&start=5000&end=6000 can make that request just as they would any other HTTPS request, then detect that the response is an htsget ticket and activate their htsget machinery accordingly. So no special URL scheme is needed when the user supplies a complete (with query parameters) htsget request.

OTOH clients like IGV given a sample URL like https://srv.example.com:8080/g1kdata/reads/NA12878 and a region of interest need some way to know that they should add the referenceName/start/end parameters to make an htsget request, rather than e.g. download an index and make Range requests themselves. (The same issue would apply to a samtools invocation like samtools view https://srv.example.com:8080/g1kdata/reads/NA12878 chr1:5000-6000: if the URL is an htsget resource, samtools might like to construct an htsget request itself.)

I believe communicating that an URL will respond to htsget-style requests is the fundamental motivation for the proposed URL scheme. Otherwise each tool could have its own command line option or preferences, but an approach that was the same for all tools and didn't require a separate option would be preferable.

(Another approach that wouldn't require defining a new scheme would be to alter the htsget protocol so that all htsget requests were required to contain an htsget query parameter. Then an htsget server could respond to a request for plain https://srv.example.com:8080/g1kdata/reads/NA12878 with a distinctive error that would signal the client to retry as htsget. Using a distinctive URL scheme as proposed would have the small advantage of avoiding this extra round-trip.)

If such an URL scheme was going to be made a convention, obviously the htsget spec would be a place to do it.

As to the format and contents of the URL, what I think we were envisaging when this was discussed in the htsget meeting was that the htsget://… URL would be identical to the real URL but with htsget replacing the real scheme; hence for the above example:

htsget://srv.example.com:8080/g1kdata/reads/NA12878

(This is consistent with @jrobinso's description of what HTSJDK already supports.) If it did not contain all the parts of that URL (i.e., all of the hostname, port, and path), it would not in general be possible to reconstruct the complete URL — e.g., there is no reason a host couldn't serve multiple htsget datasets on different paths if it wanted to, so you can't drop components of the path as doing so would be ambiguous as to which dataset you were querying.

Plain http is decreasingly common these days, but it may also be necessary to define htsget+http://… and htsget+https://… as well (with htsget://… as a shorthand for most likely the latter) in order for the htsget URL to specify the HTTP(S) flavour of the connection.

(Discovery and the exact path of the endpoints is a separate question. It was e.g. discussed on the htsget mailing list in February 2020, subject “htsget endpoint path is partially hardcoded” et al. In the past IIRC I have been told that Service Registry is the GA4GH solution for discovery.)

jrobinso commented 3 years ago

At the moment I'm doing the following in IGV (and igv.js) to determine if an https:// url is htsget. This is done after all other possibilities for interpreting the URL are exhausted. Its working, but feels maybe a bit fragile. I would still need to do this for an htsget:// url in order to discover the type of service (variant or reads), so at the moment the htsget:// hint if you will doesn't really save any calls, however step 2 could be skipped.

(1) query the server with class=header (2) examine the response to determine if its an htsget server (json, and has an "htsget" container) (3) examine the htsget.format variable to determine what it serves (VCF, BAM, ...).
(4) create an appropriate track

I use the class=header parameter in the initial query, even though I don't really want or need the header, to avoid accidentally requesting the entire file and having it returned in the ticket as data URIs. This would be a pathological case, unlikely.

re http, currently given an htsget:// protocol I first try https://, catch errors, and if it fails try http://. I think the htsjdk does something similar based on comments there.

jrobinso commented 3 years ago

@jmarshall With respect to the meeting notes you referenced, it looks like an IGV use case is at least partly responsible for the "htsget" idea. IGV doesn't need it, both desktop and web versions are fully implemented now, bugs and issue reports notwithstanding. The htsjdk is using it, perhaps @lbergelson can expand on its usefulness there, but the IGV team (speaking) is neutral on this. If it is implemented a straight swap of protocols (htsget -> https/http) with no other change in the URL is preferred, as is apparently currently the case.

jmarshall commented 3 years ago

I believe communicating [to clients] that an URL will respond to htsget-style requests is the fundamental motivation for the proposed URL scheme.

The proposal to use an htsget:// URL scheme is for the benefit of clients such as IGV and seems to have had its genesis in IGV use cases, but we have the author of IGV telling us that IGV doesn't need it. So it may be that it is unnecessary.

An alternative would be to formalise what IGV is doing. If a client makes an ordinary non-htsget HTTP(S) request for a region from a BAM/CRAM/VCF web resource (e.g. https://example.org/reads/NA12878) then it will open connections to the file and its index:

https://example.org/reads/NA12878
https://example.org/reads/NA12878.bai
https://example.org/reads/NA12878.csi
etc

in some order, and hope to make Range requests on the main file based on the contents of the downloaded index (from whichever index extension succeeded). An htsget server could signal the client to retry this as an htsget request in the usual HTTP way by returning a 426 Upgrade Required status code:

HTTP/1.1 426 Upgrade Required
Upgrade: htsget/1.3.0
Connection: Upgrade

This would be the client's signal to retry the request using htsget-style ?referenceName=…&start=…&end=… (or POST with an htsget-style JSON payload) instead of using an index.

An htsget server certainly should return 426 for requests for index filenames (e.g. URLs with spurious .bai, .csi, etc at the end). Whether to return 426 or an htsget ticket for a plain request for a valid resource is likely to be server-dependent. A case could be made for returning a ticket representing the entire file (which would also signal htsget, but be susceptible to the pathological case that @jrobinso mentioned); or a case could be made that the point of htsget is to make requests for subsets of files and refuse to return the whole file by returning 426 instead. (We could define ?class=all to be an explicit htsget-style request for the whole file if that was useful.)

IMHO this is a better approach than a bespoke URL scheme because it keeps the details of the shift to htsget between the client and server. Having an htsget:// scheme would mean that everyone would need to be aware of it and remember to use it — whereas with 426 status codes it is automatic and internal.

jmarshall commented 3 years ago

https://github.com/samtools/hts-specs/issues/581#issuecomment-882949713 by @andrewpatto:

The [FASP] call agreed there was probably a need for a htsget:// scheme and that DRS could serve these up i.e there does not appear to be a natural mechanism in DRS to expose the richness of the htsget protocol by returning (for example) https:// URLs of htsget endpoint - so something calling it out as a special protocol scheme seems needed

Why does DRS need to expose the richness of the htsget protocol?

What issue would there be with serving up an https URL from DRS, and only finding out later when the client went to access it that the URL speaks htsget?

brainstorm commented 3 years ago

(...) seems to have had its genesis in IGV use cases (...)

Not sure if that's accurate. IIRC htsjdk had that htsget:// scheme before discussing and working for htsget support on IGV-desktop.

jmarshall commented 3 years ago

That's why I wrote “seems”; moreover this was a reference to this previous comment: https://github.com/samtools/hts-specs/issues/581#issuecomment-883894364.

HTSJDK introduced this in its client implementation in samtools/htsjdk#1494, which didn't explain whether or in what way the scheme was necessary in their implementation. Perhaps @andersleung or @lbergelson can shed some light on this.

Anyway: I withdraw any comment about the genesis of this scheme proposal, and it's immaterial in any case. I would be interested in your and others' views on the substance of the counterproposal.

jrobinso commented 3 years ago

I like the 426 response proposal, it would clean up the client code. IGV, and I imagine other clients, reads the index first, and there are 3 patterns that needed to be tried, ".bam.bai", ".bai", and ".bam.csi". The first 2 because the htsjdk and samtools use different naming conventions for indeces. To be honest I don't recall if ".csi" is also tried, but the point is any of the common patterns for indexes would ideally return the 426, so the first one succeeds and we don't cycle through all of them.

lbergelson commented 3 years ago

The origin of the htsget:// scheme comes from GATK as a way of specifying on the command line what the type of the input is. It's useful for use to be able to determine what sort of datastore we're going to be reading from before we reach out and poke anything. It's not impossible to change to follow re-directs or something like that, but this seems simple and when failures occur we understand the users's intent so we can write a clear error message.

We're willing to adapt to an alternative mechanism like what @jmarshall proposed if that's the consensus.

jrobinso commented 3 years ago

@jmarshall, On reflection, due to the necessity of knowing what is being served I'm not sure the 426 will in the end save anything. The client (e.g. IGV) just has a URL to start, its not known if its a URL to a "bam" source until the initial server poke. By the time the data type is known its also known that its an htsget server (or not), , so I can't imagine a situation where an attempt is made to fetch an index.

jb-adams commented 3 years ago

From @jmarshall

Why does DRS need to expose the richness of the htsget protocol?

What issue would there be with serving up an https URL from DRS, and only finding out later when the client went to access it that the URL speaks htsget?

The url property of the AccessURL object in DRS is a direct URL for fetching the bytes of the object in question. A client would expect that when they hit this URL, they receive the contents of the file. If a DRS service was to provide a URL to an htsget object without any clear indication, the client would hit the endpoint and might mistakenly think the htsget ticket was in fact the file contents. I suppose a DRS client could be made to scan the first set of bytes to determine if the response is an htsget ticket, but it might be cleaner for the DRS service to provide that info up front (e.g. type: htsget), thereby letting the client know to perform follow up API calls after receiving the ticket.

lbergelson commented 3 years ago

I think it's useful for a client to be able to identify what type of object it's accessing without making any web requests. The htsget:// scheme satisfies that, but adding a distinctive query parameter would also work, like @jmarshall suggested earlier.

jmarshall commented 3 years ago

[@jrobinso]

I'm curious, because I faced this with IGV, htsget:// alone doesn't tell me if the source is serving variants or alignments, so I still have to poke it to know what sort of reader object to instantiate […] @jmarshall, On reflection, due to the necessity of knowing what is being served I'm not sure the 426 will in the end save anything. The client (e.g. IGV) just has a URL to start, its not known if its a URL to a "bam" source until the initial server poke.

This is a foible of IGV's implementation. For htslib for example, detecting that the response is an htsget ticket is done in the file access layer, before any decisions have been made that need to know whether the stream is going to be used as a samFile or a vcfFile.

For any URL, you don't know for sure if it's going to be a “bam” or a “vcf” or another kind of source until the initial server poke. So I guess I'm a little surprised that in general you don't do the the initial server poke (to get a Content-Type or to sniff the first few bytes to detect formats) before you instantiate different kinds of reader objects. I assume for most direct URLs, you do some heuristics based on the extension at the end of the URL? For htsget you could check for the URL path containing …/reads/… or …/variants/… and that heuristic would be right 90% of the time… but you don't know for sure in advance that an arbitrary URL is going to be an htsget URL.

[@jb-adams]

The url property of the AccessURL object in DRS is a direct URL for fetching the bytes of the object in question. A client would expect that when they hit this URL, they receive the contents of the file.

This is quite an extraordinary statement for DRS to make. The nature of the web is that you can make likely inferences from the structure of URLs, but you don't know for sure what you're going to get until you make the request and receive a response. In particular, an URL may result in a redirect and the client is expected to make a request to another URL as specified. Are you saying that AccessURL prohibits redirects?

An htsget ticket is really a form of redirection.* What I hear you saying is that DRS would like the https://example.org/reads/NA12878 URL to result in a ticket rather than 426. Or at least, that there be a form of the URL that DRS can use for AccessURL that results in a ticket.

(* ISTR we briefly mused about using a 3xx status code for the ticket response, but considered that this could make the protocol harder or impossible to implement using some HTTP client libraries — which might be expecting to handle all 3xx responses themselves.)

[@lbergelson]

I think it's useful for a client to be able to identify what type of object it's accessing without making any web requests. The htsget:// scheme satisfies that, but adding a distinctive query parameter would also work, like @jmarshall suggested earlier.

My problem with htsget:// is (1) it's a hack; (2) it hides the http/https/etc transport; (3) it's unnecessary, and users shouldn't need to be aware of it; (4) it doesn't get you any closer to the raw bytes, so doesn't really solve DRS's problem.

If people want an optional indication that an URL is likely to be an htsget URL, an optional distinctive query parameter is also a hack but a lesser one and provides a useful optional heuristic. e.g.,

https://example.org/reads/NA12878?htsgetdatatype=reads

could be used by DRS and others if they wished, is distinctive, could be validated by htsget servers, and solves @jrobinso's problem.

jrobinso commented 3 years ago

@jmarshall Maybe its a "foible", but IGV supports ~ 45 file formats plus a few web services, and in most cases it would not be possible to detect file format even after reading the entire file. So yes it insists on some conventions if you want to view your data there, in most cases file extension. I'm able to make an exception for htsget because it returns a defined json container (the ticket) with a format specifier so we can know what it serves. If anything is missing here, and this is minor, its a call that says yes I'm an htsget server and this is what I'm serving, I am using. "class=header" for that now which works well enough.

jmarshall commented 3 years ago

If anything is missing here, and this is minor, its a call that says yes I'm an htsget server and this is what I'm serving

The intention would be that service-info satisfies this need. (It's a relatively recent addition to the htsget spec so extant servers may or may not support it yet.)

Given an URL like https://example.org/reads/NA12878, you would trim the /<id> part of the path off and replace it with /service-info to make a request to https://example.org/reads/service-info. If it's not an htsget or other GA4GH-style server, you'll get a 404 or at least something that's not service-info JSON. (Or if it's an htsget server that doesn't implement service-info, that'd be a 404 too…)

However in htsget the <id> part is not constrained to be a single path segment. So for an URL like https://example.org/pub/reads/IGSR/g1k/NA12878 you would need to trim a segment at a time and try each in turn:

https://example.org/pub/reads/IGSR/g1k/service-info
https://example.org/pub/reads/IGSR/service-info
https://example.org/pub/reads/service-info

or perhaps use some heuristic around trimming back to …/reads (for the usual case in which the exact text either …/reads/… or …/variants/… appears). This is not ideal, and it's not obvious how clients are intended to construct a service-info URL from an arbitrary URL on that service (or indeed that they are intended to do so at all).

ianfore commented 3 years ago

@andrewpatto wrote

The call agreed there was probably a need for a htsget:// scheme and that DRS could serve these up i.e there does not appear to be a natural mechanism in DRS to expose the richness of the htsget protocol by returning (for example) https:// URLs of htsget endpoint - so something calling it out as a special protocol scheme seems needed

If DRS was to serve up some form of htsget:// url, we think the client would need it (the url) to contain

the base URL (hostname?) of a htsget service

the ID of the object in question

Is there other info the url would need to contain? Does the htsget:// url also contain paths to /reads or /variants or is the concept of it specifying the htsget [base] a good one?

(there was then a discussion about the naming of the /reads and /variants endpoints - and if these are custom then how could they be discovered - at the moment even the service-info endpoints are not at a 'known' spot if custom paths are used)

Some thoughts about how DRS and htsget interact

It's not tractable for DRS to make provision or account for the type/protocol of everything it may serve a payload. This is another variant of the issues we have discussed with DICOM and whether DRS can/should reflect the structure (model) of a specific data type. Besides, htsget's ability to retrieve specific regions, the ability to access /reads or /variants requires knowledge of the specific datatype being handled. DRS is not the protocol to provide reach-in to the specifics of the objects it carries. Improvement of how type is indicated in DRS would deal with that. It is also likely that the reads and variants should be specific objects with their own DRS ids.

That said, as an example we have demonstrations that the URL provided by DRS can be passed to SAMTools. That could be used to provide very similar functionality to htsget. The difference as I understand it is, for htsget, the slicing of the file would be done on the htsget server. Using the DRS url the slicing would be on the WES server where you are running samtools. In theory samtools would have to retrieve the whole file. However, the widely expected behaviour is that the compute (samtools) would be run in the same cloud region as the file - so no download occurs. Properly organized, there should be minimal net difference in performance. The difference is really one of convenience for the user. DRS and WES provide generic capability for you to roll your own solution to many problems. htsget provides specific capability for given datatypes more simply packaged.

There's also a mismatch on ids. A DRS id will always give you the same set of bytes. That's a fundamental of DRS intended to address reproducibility etc. The accessions used in the htsget examples (e.g. NAxxxxxx) wouldn't consistently give the same set of bytes. That's not say it wouldn't be useful to be able to use DRS ids with htsget to refer to the same binary data. That could be separated from the use of the DRS protocol to access the file.

andrewpatto commented 3 years ago

#581 (comment) by @andrewpatto:

The [FASP] call agreed there was probably a need for a htsget:// scheme and that DRS could serve these up i.e there does not appear to be a natural mechanism in DRS to expose the richness of the htsget protocol by returning (for example) https:// URLs of htsget endpoint - so something calling it out as a special protocol scheme seems needed

Why does DRS need to expose the richness of the htsget protocol?

What issue would there be with serving up an https URL from DRS, and only finding out later when the client went to access it that the URL speaks htsget?

Sorry for the very late reply to this.

I don't fundamentally disagree with anything in the thread - but I guess can make some comments where my thinking from a DRS perspective might provide some different arguments.

DRS wouldn't be providing specific locations for access if it was to serve up htsget URLs - so no referenceName/start/end - so anything relying on the presence of those in the URLs to 'know' htsget doesn't really work from the DRS use case (addressed possibly by adding eg htsgetdatatype=reads for the purposes of discovery)
Absent a marker in the URL itself, it points to the need for some sort of htsget 'discovery' mechanism as described above for what IGV and others do - start with a range or HEAD request and detect mime types etc - before then engaging the htsget protocols as a client
In the DRS case though - the DRS server might be serving up large numbers of these links where the DRS server itself knows that htsget is the mechanism (overlaying https) it wants the client to use. So currently without a htsget uri format its "here's 100 https link - btw check each of them before hand before you start to treat them as htsget". I think the DRS thinking is it would be good to be able instead say "here are 100 links you should access via htsget". (understanding that obviously if the client received back a JPG image - it would have to abort.. it is true you can never know exactly what will be returned until it comes back - but there is a difference about the initial assumptions you might make for the pattern of use)

So totally understand that it might not be seen as a big enough issue to warrant a htsget URI format (following the above thread - I'm not even convinced myself) . Just putting some of the arguments out there.

andrewpatto commented 3 years ago

I would also add though whilst I totally agree that htsget can in some way be viewed as being just http - I also think that it has a clearly defined 'pattern of use' of http that is unique to it. That is, the custom 'known' parameters names like referenceName etc. The expected responses by the server and the custom way that that response is then 'followed' to get to the data.

So is it a protocol layered on top of another protocol?

https+htsget:// ??

(I'm sure there is some RFC that says this is a bad idea..)

brainstorm commented 3 years ago

I think that the bottom line here should be to either include this scheme on the standard or explicitly discourage it since implementations are starting to diverge and might cause different types of (integration) troubles in the (near) future.

brainstorm commented 2 years ago

As agreed in today's htsget APAC-friendly meeting, I've been tasked to summarise this thread in the following table and then I also added some more pros/cons to the mix.

Full disclosure: While I was quite partial to the htsget:// scheme, now I see some of its drawbacks while constructing this table.

Please do help/comment in improving this analysis if you see things that are unclear/malformed/biased in any way.

https:// (status quo)		?htsgetdatatype=reads		https+htsget://		htsget://
pros	cons	pros	cons	pros	cons	pros	cons
no breaking spec changes	more involved url handling client code	no breaking spec changes	yet another parameter to handle	follows popular https+git://-like scheme(s)	not recognised by IANA/IETF	own scheme helps with htsget adoption	"a hack???" (as seen by @jmarshall)
-	does not scale well with multiple urls	early hint for clients	-	early hint for clients	-	adopted by htsjdk already	if done properly, we'd need to follow RFC8615
-	more guesswork needed from clients	less guesswork needed from clients	additional code to check for datatype	less guesswork needed from clients	might break existing code	less guesswork needed from clients	might break existing code
-	more client checks and requests	less client checks and requests	requires changes to htsget spec	less client checks and requests	requires changes to htsget spec	less client checks and requests	requires changes to htsget spec

brainstorm commented 2 years ago

Closing as discussion seems stalled and @jmarshall is going for HTTP 426 response (PR #665) anyway.

jmarshall commented 2 years ago

It is true that discussion has stalled, partly because until yesterday we had not had an htsget meeting in quite a while. This issue took up most of the time of yesterday's meeting and I think there was general agreement that this is htsget's main open question at the moment and we will endeavour to get discussion rolling again.

The proposed HTTP 426 response can provide an alternative identification mechanism but (as explained on the PR and in yesterday's meeting) it is really orthogonal to the question of whether to use or specify a bespoke URL scheme or distinctive query parameter, as is being considered on this issue. Even if htsget does bless self-identifying URLs via a scheme or query parameter, the defined 426 response for servers to use for index file requests may be a useful thing to have in the spec as well. So considering PR #665 does not mean that this issue's question has been decided.

samtools / hts-specs

Standardize htsget:// scheme on spec #581