jkunze / bagitspec

31 stars 11 forks source link

How should a resource in fetch.txt be content negotiated? #10

Open stain opened 9 years ago

stain commented 9 years ago

When resolving a URL in fetch.txt, you may get different results depending on content negotiation. Therefore you may get a different resource back (e.g. HTML instead of JSON) depending on the browser and client setting used to retrieve such a resource - obviously if you get the "wrong one" the bagit checksums will be wrong.

I think the specification should recognize this, and perhaps specify the default Accept headers to use, e.g.:

Accept-Language: *
Accept-Charset: *
Accept: application/octet-stream, */*;q=0.1

The headers Accept-Language and Accept-Charset may be excluded as their default is *.

carlkesselman commented 9 years ago

An alternative would be to include a ETAG for the particular rendering of the data object for which the checksum applied. Simply stating the accept type is not enough, because you could have different encoding of the octet stream. With this approach the line in the fetch.txt file would be

filename HTTP_URL ETAG

and a GET with the ETAG and a strong validation requirement would ensure that you get the exact same encoding (and data) that the author intended. Of course this requires that we know the ETAG for the rendering that we want when we construct the reference, and the server is properly implemented.

stain commented 9 years ago

Hmm.. not so sure about ETag here, you can't GET a particular Etag, just use it for conditional methods, with the If-Match header (or the inverse If-None-Match) - which would still require the Accept* headers to compare the Etag of the correct representation.

As for knowing you have the correct representation after retrieval, (depending on #8), you could just check the checksum in the manifest-*.txt for the remote resource.

ETags are however more powerful than such checksums, as they can be modified only for structural or semantic equivalence rather than byte-wise equivalence - e.g. a JSON representation that is constructed on the fly from an ElasticSearch instance.

There is nothing stopping a web-server from issuing a new ETag even when the content is byte-wise the same, e.g. because it is running on a newer version of Apache HTTP (indeed, some Apache installs used inode numbers and timestamps, so moving a file might change the ETag). Remember ETags are only intended for caching where the fallback is simply "just download it again".

This fits into #7 - are you meant to refresh a file from fetch.txt if it has changed on the server (new ETag) or has expired according to its cache headers?

carlkesselman commented 9 years ago

Hi,

Yes, you are right.

I think I’ve been thinking about this wrong. The fetch.txt is just a mapping from a local name to an actionable URI. For the bag to be complete and valid, there must be some set of steps where that actionable URI can be converted into a set of bits that have the correct checksum, however the exact nature of those steps is out of scope for the spec as it will depend on the nature of the URI and the services implemented to provide the contents. The only question is that is having the URI (and I assume the desired checksum) enough information or will there be cases in which additional hints that cannot be encoded in the URI would be needed.

Carl


Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

On Jul 29, 2015, at 3:00 PM, Stian Soiland-Reyes notifications@github.com<mailto:notifications@github.com> wrote:

Hmm.. not so sure about ETag here, you can't GET a particular Etag, just use it for conditional methods, with the If-Match headerhttp://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.24 (or the inverse If-None-Match) - which would still require the Accept* headers to compare the Etag of the correct representation.

As for knowing you have the correct representation after retrieval, (depending on #8https://github.com/jkunze/bagitspec/issues/8), you could just check the checksum in the manifest-*.txt for the remote resource.

ETags are however more powerful than such checksums, as they can be modified only for structural or semantic equivalence rather than byte-wise equivalence - e.g. a JSON representation that is constructed on the fly from an ElasticSearch instance.

There is nothing stopping a web-server from issuing a new ETag even when the content is byte-wise the same, e.g. because it is running on a newer version of Apache HTTP (indeed, some Apache installs used inode numbers and timestamps, so moving a file might change the ETag). Remember ETags are only intended for caching where the fallback is simply "just download it again".

This fits into #7https://github.com/jkunze/bagitspec/issues/7 - are you meant to refresh a file from fetch.txt if it has changed on the server (new ETag) or has expired according to its cache headers?

— Reply to this email directly or view it on GitHubhttps://github.com/jkunze/bagitspec/issues/10#issuecomment-126108986.

stain commented 9 years ago

Yes, whenever a server provides a Vary header, then you cannot reliably use it in fetch.txt - unless there is a Content-Location header which you could use instead of the requested URL.

Anything doing non-RESTful stuff like authentication-based resources (e.g. http://example.com/me that varies per user), Cookies, etc. are also out.

I guess HTTP redirects are OK, but could be a warning sign. Some content-negotiations on Linked Data result in a redirect based on your headers, after which you can download the final URL.

E.g. http://purl.uniprot.org/uniprot/M0R3D1 with Accept: text/turtle ultimately takes you to http://www.uniprot.org/uniprot/M0R3D1.ttl - clicking the first link in the browser shows a friendly HTML page instead which you probably don't want in your Bag.

Semantically the first is the protein identifier which is what we really want to aggregate, while the second is the representation that you want in fetch.txt - at least as we don't have a way to specify the Accept header.

Ardvaark commented 9 years ago

The fetch.txt was really more of a hack to facilitate the parallelized transfer of large bags using standardized tools, rather than something more exotic (Signiant, GridFTP, et. al.) that we couldn't afford and/or comprehend. Created before the heady days of REST and content negotiation were really well-understood (by me, anyway), there was an implicit expectation of accepting whatever it was the server decided to send down, since it was being as an opaque blob of bits anyway. The liberal, accept headers you suggest above, @stain, seems to me the closest approximation to that intention.

But really, this might just be considered a feature and left as a problem for humans to work out. That the fetch.txt is for transfers also implies some sort of collaboration or work going on between the parties in the ephemeral now. If you GET an asset, and don't get the fixity you expect, you'll probably be emailing somebody on the other end to talk about it. Explicitly specifying the Accept headers doesn't actually do much to address the possibility of the server giving you back something you didn't expect.

stain commented 9 years ago

No, but a piece of software that is trying to complete the bag does not have the luxury of contacting the person responsible. It can however talk say the HTTP protocol, so I think we should have a minimum acknowledgement of this issue (perhaps written in a protocol-neutral way) so that different bagit software do this in somewhat similar ways.

carlkesselman commented 9 years ago

I would agree with this. I've come to realize that in the HTTP space, there seems to really be nothing you can do generically to force a server to give you what you want. You resolve the name, and hope that the object is hosted by a service that offers bitwise perfection as part of its policy. Once you get the bits back, you do have the checksum, so you can validate that you got the bits or you didn't. As you point out, this can get complicated especially when we consider a variety of URIs such as DOIs or ARKs or just plain old URLs. We might want to consider a couple of non-normative examples in the spec?

robes commented 9 years ago

I too was hoping we could leverage the ETag field as @carlkesselman mentioned, but I think I may have to agree with @stain. Given the w3c specs as they are then, it seems that we can't expect persistence and stability as anything other than a 'matter of service'. Ultimately, from the perspective of the end-to-end principle, this doesn't change the fact that (URL, checksum) pairs or some other form of (location, identity) pairs are what allow the consumer to determine whether they can find and verify the desired representation of the desired resource.

But still, in the interest of reducing some avoidable errors caused by content negotiation, it seems desirable to allow the client to make a more specific request within the bounds of the protocol they are using. Would it be reasonable to suggest that the fetch.txt support:

filename  URL  [list of protocol-specific options]

Where those options could include Accept headers in the case of HTTP URLs, while allowing for other protocol-specific options?

stain commented 9 years ago

The current format is unfortunately not extensible in any way, as it is:

URL LENGTH FILENAME

with space-escaping etc required for the URL, but not for FILENAME. Thus a valid line currently could be

http://example.com/file.txt 512 data/folder with spaces/filename with spaces.txt

The only other possibility here (beyond minting magic URL schemes) is negative numbers below -1, e.g. -2 could mean "Should already exist, was downloaded from here", -3 could mean "Refresh from here" (as in #7), etc. This feels a lot like 1980s C programming, though.

stain commented 6 years ago

Perhaps there can be optional indented lines below? These can just be RFC822-style headers that are protocol-specific for the client.

If size is unknown, then - can be used instead of number of bytes.

http://example.com/file1.txt 512 data/no-special-headers.txt
http://example.com/file2 8192 data/negotiated.en.html
  Accept: text/html; application/xhtml+xml,q=0.9
  Accept-Language: en
http://example.com/file2 1024 data/negotiated.jsonld
  Accept: application/json
ftp://example.org/file3.txt - data/unknown-size.txt
gsiftp://example.net/file4.bam 17179869184 data/quite-large-over-gridftp.bam?cc=1;tcpbs=10M;P=4
acdha commented 6 years ago

I'm somewhat mixed on whether fetch.txt should be part of bagit at all — it's fine for the simple cases but the additional possible features I've seen requests for start to quickly take on as much complexity as the entire BagIt spec.

Since BagIt allows arbitrary top-level tag files, in general I would pose the question of whether we should do anything other than plan to freeze/deprecate fetch.txt and encourage people to use something like Metalink (aka RFC 5854) with a well-known filename. That'd get out of the box a more complete spec and clients (e.g. curl) with support for things like mirroring without duplicating that work in the BagIt world.

In this specific case, I'm also a little skeptical about using content negotiation in this context. The web as whole has been slowly moving away from it due to the complexity cost and in our context I'd worry that attempting to have bit-level fixities for negotiated content is going to be uncommon for that reason.