Pass `Accept` header in `contrib.utils.download`

GraemeWatt commented 3 years ago

I'm copying a comment here that I made in the HEPData Zulip chat on 16th October 2020.

Regarding the issue (HEPData/hepdata#162) to mint DOIs for all local resource files attached to a submission, if we do eventually get around to addressing it, we would probably redirect the DOI to a landing page for the resource file, rather than to the resource file itself (e.g. the pyhf tarball). This would follow the DataCite Best Practices for DOI Landing Pages, e.g. "DOIs should resolve to a landing page, not directly to the content", which I'm currently breaking for the two manually minted DOIs. In the issue (HEPdata/hepdata#162) I mentioned the possibility of using DataCite Content Negotiation to redirect to the resource file itself, but the linked page now says "Custom content types are no longer supported since January 1st, 2020". I thought maybe content negotiation could be used to return the .tar.gz file directly, but the intended purpose is to retrieve DOI metadata in different formats, not to provide the content itself. In anticipation of possible future changes, I'd recommend that you use the URL directly rather than the DOI in pyhf download scripts and documentation (e.g. revert #1109).

matthewfeickert commented 3 years ago

Thanks @GraemeWatt. This is important, so we're quite happy you're bringing this (back) to our attention.

So, from my understanding of what you've shown here, as the recommendation from DataCite is that "DOIs should resolve to a landing page, not directly to the content" and that "The DOI should be appropriately tagged (so that machines can read it)" and "can retrieve additional information about the item that might not be easily retrievable from the item itself." But as you've said that there's no way to get access to the actual data products associated with that particular then I guess I'm not clear on what purpose the DOI has if it just is the metadata.

In the section The landing page should provide a way to access the item

Humans should be able to reach the item being described from the landing page. If the item has been removed, retracted, or otherwise made unavailable to the public on purpose, the landing page should serve as a "tombstone page", providing enough information that the item can be identified and confirmed to have existed at one time.

only makes explicit mention of humans as opposed to humans and machines. So does this mean that DOIs are becoming human use only and that accessing a data product associated with a DOI is necessarily a two step process (get the DOI and then from the DOI landing page the the data product download URL)?

I am perhaps missing something obvious about all of this. If so, if you have an explicit example that would be great to see.

danielskatz commented 3 years ago

Hey @mfenner - can you help here?

I think it should be possible to programatically query the DOI and get the location of the underlying object, then fetch it.

Is this correct? Is there any code available that demonstrates this?

matthewfeickert commented 3 years ago

Just wanted to follow up on this if @mfenner has time for input. Any thoughts here are appreciated!

mfenner commented 3 years ago

Unfortunately DOIs routinely point to landing pages and not the content, as mentioned in the comments above. There are a number of reasons why this makes sense, e.g. access restrictions and different file formats, but that makes automated machine access very hard. A new DOI metadata field contentURL is therefore on the list of improvements planned for the next DataCite metadata schema, planned to be released in 12-18 months.

Metadata are specific to each DOI registration agency, so these things might work slightly differently for Crossref or any of the other DOI registration agencies.

If schema.org metadata are available (via the landing page), one can use the contentURL property of schema.org.

GraemeWatt commented 2 years ago

I've been investigating three options to directly return content (i.e. the pyhf tarball) from the DOI after we mint DOIs for local resource files with URLs directing to a landing page rather than the resource file itself (see HEPData/hepdata#162).

Following the suggestion of @mfenner, we could embed Schema.org metadata on the HEPData landing page for the resource file in JSON-LD format (see HEPData/hepdata#145) including a contentUrl property. One problem is that doing curl -LH "Accept: application/vnd.schemaorg.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 or curl -LH "Accept: application/ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 returns JSON-LD from DataCite (without contentUrl) using DataCite Content Negotiation before getting to the HEPData server. I think we would need to introduce a custom metadata content type like curl -LH "Accept: application/vnd.hepdata.ld+json" https://doi.org/10.17182/hepdata.89408.v1/r2 to return the JSON-LD from the HEPData landing page. The pyhf code would then parse the contentUrl and make the download in another request.
DataCite offers a media API where custom content types can be registered and then later retrieved via a public REST API, although content negotiation is no longer supported. However, it should be possible to retrieve the metadata via, for example, https://api.datacite.org/dois/10.17182/hepdata.89408.v1/r2 and then parse the media to find the registered URL of the content for a specific media type like application/x-tar. I tried to test the DataCite media API by registering a custom content type for one DOI, but it doesn't seem to be working. I reported the problems I found to DataCite support, but I don't think the media API option is worth pursuing further.
A 2019 blog article by @mfenner mentions an alternative option to "use content negotiation at the landing page for the resource that the DOI resolves to. DataCite content negotiation is forwarding all requests with unknown content types to the URL registered in the handle system." This seems like the simplest option for the pyhf use case. The HEPData landing page for the resource file can check if the Accept request HTTP header matches the content type of the resource file and return the content directly if so, for example, curl -LH "Accept: application/x-tar" https://doi.org/10.17182/hepdata.89408.v1/r2. In the pyhf Python code, you'd just need to replace this line: https://github.com/scikit-hep/pyhf/blob/260315d2930b38258ad4c0718b0274c9eca2e6d4/src/pyhf/contrib/utils.py#L56 with:
```
    with requests.get(archive_url, headers={'Accept': 'application/x-tar'}) as response:
```
Some other suggestions for improvements to this code:
- Check the response.status_code and return an error message if not OK.
- Use tarfile.is_tarfile to check that response.content is actually a tarball and return an error message if not.
- Remove mode="r|gz" or replace it with mode="r" or mode="r:*" for reading with transparent compression, so that the code works also with uncompressed tarballs (see #1111 and #1519), where the media type is still application/x-tar.
- Maybe add an option to download a zipfile instead of a tarball (see #1519), then you'd need headers={'Accept': 'application/zip'} in the request and zipfile.is_zipfile to check the response content. You could use the Python zipfile module to unpack, but maybe easier to use shutil.unpack_archive for both tarballs and zipfiles.

Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I'd therefore recommend you make them ASAP before the next pyhf release. After we redirect the DOI to the landing page, probably in the next few weeks, the DOI will return the HTML landing page instead of the tarball unless the request contains the Accept: application/x-tar header.

mfenner commented 2 years ago

I agree with your analysis. The DataCite media API was deprecated as it doesn't really fit into the outlined model. And content negotiation for application/ld+json unfortunately triggers the DataCite content negotiation.

kratsg commented 2 years ago

(Very nice analysis). One slight concern I have with this is that the HistFactory JSON should not be treated as the only kind of JSON-like item that would be uploaded to HEPData -- is this taking to account a way to request a particular item as such, or would this be downloading all JSON items in a record?

GraemeWatt commented 2 years ago

@kratsg, it seems you misunderstood, so let me try to clarify. Solutions 1. to 3. above are ways to download a resource file (e.g. a pyhf tarball) from HEPData given the DOI. Solution 2. doesn't work, so you should concentrate on solution 3. which is simpler than solution 1.

The Schema.org JSON-LD referred to in solution 1. is a way of embedding metadata in a web page so it can be indexed by search engines (see "Understand how structured data works" from Google). This has nothing to do with the pyhf JSON format (apart from obviously being JSON-based)! Until we had solution 3., the proposal in solution 1. was that we could add a field contentUrl to the embedded metadata, then you could retrieve the JSON-LD from the landing page for the resource file to find the download link given the DOI. But you don't need to worry about this now that solution 3. has been developed. We'll still make solution 1. available as it might be helpful for other use cases, it enables indexing by search engines, and it is an open issue from 2018 to upgrade to JSON-LD from the older Microdata format currently embedded in HEPData web pages.

matthewfeickert commented 2 years ago

A 2019 blog article by @mfenner mentions an alternative option to "use content negotiation at the landing page for the resource that the DOI resolves to. DataCite content negotiation is forwarding all requests with unknown content types to the URL registered in the handle system." This seems like the simplest option for the pyhf use case. The HEPData landing page for the resource file can check if the Accept request HTTP header matches the content type of the resource file and return the content directly if so, for example, curl -LH "Accept: application/x-tar" https://doi.org/10.17182/hepdata.89408.v1/r2. In the pyhf Python code, you'd just need to replace this line: https://github.com/scikit-hep/pyhf/blob/260315d2930b38258ad4c0718b0274c9eca2e6d4/src/pyhf/contrib/utils.py#L56

with:
        with requests.get(archive_url, headers={'Accept': 'application/x-tar'}) as response:

Thanks for this excellent analysis and summary @GraemeWatt — truly appreciated! :rocket: I'll get this in right away and then we can make additional improvements.

Some other suggestions for improvements to this code:

Check the response.status_code and return an error message if not OK.

Use tarfile.is_tarfile to check that response.content is actually a tarball and return an error message if not.

Remove mode="r|gz" or replace it with mode="r" or mode="r:*" for reading with transparent compression, so that the code works also with uncompressed tarballs (see Extend pyhf contrib download to allow for uncompressed targets #1111 and Make pyhf contrib download be able to handle multiple compression types #1519), where the media type is still application/x-tar.

Maybe add an option to download a zipfile instead of a tarball (see Make pyhf contrib download be able to handle multiple compression types #1519), then you'd need headers={'Accept': 'application/zip'} in the request and zipfile.is_zipfile to check the response content. You could use the Python zipfile module to unpack, but maybe easier to use shutil.unpack_archive for both tarballs and zipfiles.

Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I'd therefore recommend you make them ASAP before the next pyhf release. After we redirect the DOI to the landing page, probably in the next few weeks, the DOI will return the HTML landing page instead of the tarball unless the request contains the Accept: application/x-tar header.

These are all excellent as well. I'll make these a new issue for v0.7.0 that refactors the internals.

scikit-hep / pyhf

Pass `Accept` header in `contrib.utils.download` #1491