mementoweb / rfc-extensions

0 stars 0 forks source link

Feedback on new implementation of Prefer header in pywb #7

Open ikreymer opened 6 years ago

ikreymer commented 6 years ago

I wanted to briefly mention a new implementation of Prefer header that is being added to pywb, as part of work for the UK Web Archive.

(The implementation is currently available on this branch in https://github.com/ukwa/pywb)

Here's a brief summary of this implementation.

TimeGate and URL-M both accept Prefer header

As a compromise between the previously suggested options, both the TimeGate and the URL-M support Prefer header in pywb and respond accordingly. The exact behavior is dependent on the memento negotiation pattern that is in use, as explained below.

Supported Preferences

The following preferences are supported:

Memento Negotiation Patterns

Since pywb actually supports multiple memento negotiation patterns defined in RFC 7089, it makes sense to have the Prefer header behavior also correspond to the negotiation pattern already in use.

Pattern 2.1 -- 302 Style Negotiation (spec)

When using 302 (*) style negotiation in pywb, the Prefer header results in a redirect to the 'canonical' url representing that format. The redirect happens when the Prefer header is present on either a URL-G and URL-M request. The Preference-Applied header is served with the response.

Pattern 2.2 -- 200 Style Negotiation (spec)

When using 200 style negotiation, the Prefer header can also be applied on URL-G or URL-M, and the desired resource is served directly, with the correct Preference-Applied header. The Content-Location header is set with the canonical representation of the resource.

This mode is the default in pywb 2.0

Pattern 1.3 -- 200 Style Negotiation (spec)

The Pattern 1.3 pattern (**) is the proxy mode behavior, where the user connects to pywb via an HTTP/S proxy and no url rewriting is performed. The Prefer header is also supported in this mode, and the Preference-Applied is returned in response. Since URL-M = URL-G = URL-R in this mode, no redirect or alternative Content-Location is included. The Prefer header is especially useful for requesting different format resources since no unique canonical urls exist.

This mode only supports raw and banner-only preferences. If Prefer: rewritten is requested, the response is actually the banner-only memento, eg. Preference-Applied: banner-only

Canonical Url Representations

For non-proxy mode replay (Pattern 2.1 and 2.2), each preference corresponds to a 'canonical' urls, which are:

The canonical representation for rewritten is also changes if running in framed or frameless replay:

Request for feedback

Let me know if these is any feedback on this implementation, or other thoughts. Some of this may be particular to the pywb implementation, but some of this behavior may make sense to standardize further or change.

If anyone is interested in code, here are a few unit tests that test for this behavior:

Notes

*: pywb actually uses 307 redirects instead of 302 *: pywb almost* supports pattern 1.3 fully, but can not include a link to rel=original in proxy mode, as it is not available. Nevertheless, the behavior is otherwise identical to pattern 1.3, but perhaps there should be another name for it?

hvdsomp commented 6 years ago

Thanks for taking on this work!

Here is feedback from the Memento team in Los Alamos:

=> The terms "rewritten", "raw", "banner-only" are risky in that there could be potential for other applications of Prefer to use them., especially if they would not be registered as per "The Registry of Preferences" of https://www.rfc-editor.org/rfc/rfc7240.txt. Apart from that, it would be nice to give the terms some kind of "branding" that refers to web archiving, memento applications. For these purposes, we had used original- for the terms we proposed in http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html. Doing so also provides a kind of extensibility mechanism, i.e. all terms with a same prefix relate to the same framework.

=> Note that http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html included terms to convey semantics that don't seem to be covered by the terms introduced here. Ultimately, we as a community should decide what makes sense and what doesn't with this regard.

=> There is the question of which negotiation is handled first: datetime or Prefer. The Memento RFC, which was published prior to the existence of Prefer, does state that datetime negotiation is handled prior to any other content negotiation, by which was meant prior to e.g. language, format, etc negotiation. Given the goal of Prefer in the current context, it seems that this rule should also apply to Prefer, i.e. datetime negotiation first, Prefer next. Even though Prefer isn't really considered negotiation ...

=> The pywb implementation of Pattern 1.3 is really problematic from the perspective of Memento clients. Clients decide that a resource is a Memento on the basis of the existence of a Memento-Datetime header and a link with rel="original", see [1] of http://mementoweb.org/guide/resourcetype/. Strictly speaking, a client could use Memento-Datetime only to make that determination. But, when doing so, the client does not know what the original (URI-R) is and hence can not continue its time travel, e.g. to obtain another Memento of the same resource, visit the original on the live web. The link="original" is in that sense essential for Mementos (URI-M) and also for TimeGates (URI-G).

anjackson commented 6 years ago

Although I appreciate the idea of having a prefix to the Prefer options, I'm having trouble understanding original-*. If you just want the raw response, as per the time of capture, how do you specify that? Do you need all three of original-content, original-links, original-headers?

If so, white-listing types or re-write that should not be applied seems clumsy. If some new type or rewrite comes along (e.g. modifying video or embedded media tags to aid playback) how does that work? Do we now need to define an original-media-elements mode and start using it? Or am I misunderstanding it?

As with the banner-only option, it seems there may be a preference for whitelisting the re-writes you want rather than the other way around?

I guess rewritten is a way of saying 'all rewriting is fine', and I don't know how to express that using your original-* either, but maybe that's fine and that what not Prefering anything means?

EDIT to further explain my confusion, isn't original-content, original-links the same as original-content? It's not the original content if the links have been re-written, is it?

hvdsomp commented 6 years ago

I think two orthogonal aspects are being intertwined here:

  1. a suggestion to prefix the terms used for Prefer in Memento/archive related applications
  2. the semantics that are conveyed using Prefer in such applications

Regarding (1): I referred to the use of the original- prefix in the aforementioned blog post merely as an illustration of prefixing terms an sich, i.e. carving out a "namespace" of terms that pertain to the same framework/application. I did not suggest original- was the term to be used.

Regarding (2): The blog post indeed comes from the perspective of expressing various degrees of "rawness" and, as such, in the approach described, multiple terms can indeed be combined. It doesn't necessarily have to be that way, and I did not suggest it had to be that way in the above. That is why I indicated that "Ultimately, we as a community should decide what makes sense and what doesn't with this regard". That's also why we have asked for feedback from the community ever since the blog post was published.

In my opinion, the questions at this point are:

  1. Do we want to prefix the terms and, if so, which prefix should we pick?
  2. What are the semantics we want to convey using Prefer for memento/archive applications, i.e. what kind of Mementos do we want to be able to request? The three types of Mementos covered by Ilya's write-up make sense to me; yet see also below. Do they make sense to others?

A few more detailed considerations:

I hope we can get some reactions to this all.

anjackson commented 6 years ago

Ah, sorry @hvdsomp for not picking up that the blog post was meant to be illustrative rather than suggestive.

I have two concrete use-cases. The first is that, while attempting to integrate Mementos from multiple sources into a proxy service, I wanted to be able to request the un-rewritten entity, because none of that should be necessary in proxy mode. I had imagined this means the original headers as-is as well, but I now realise I'd not thought that through.

The second is more of a convenience, in that it would be nice to run a proxy service for users and for generating screenshots of archived resources, and in the latter case I'd want to switch off the banner. Of course, I could just set up a separate endpoint for that (in fact I'd probably do that anyway to separate out the load!), so I could live without it if it's problematic.

So, I seem to just be reiterating the main use cases covered here, but this selection of use cases does not seem broad enough to answer all the questions we have.

We could do with hearing from more users and use cases, I guess.

I have other cases that I would like to cover eventually, but they are very immature use cases that may or may not fit here. For example, as well as the standard WARC content, I also have (but have no way to give access to):

  1. The screenshot we took during the browser rendering of the original page, during the harvesting process.
  2. The thumbnail version of the screenshot we took in (1.)
  3. The screenshot we took in (1.) but with an image map overlayed so it's clickable.
  4. The HTML DOM from the browser at 'on-ready' that we stored when we rendered the web page.

We've also considered making available:

  1. A rendered screenshot of the archived version of the given resource.
  2. The re-written HTML, with particularly problematic elements (e.g. a Google Map panel) replaced with the relevant portion of the screenshot.

However, given I was kind of surprised by this statement: "When using a Memento client, no rewriting is needed for replay." I'm thinking I may have entirely missed the point.

ikreymer commented 6 years ago

Re: Namespace I agree that a namespace, such as memento- or webarchive- or wa- should be used to avoid confusion. original- is probably not a good choice.

Re: original-content, original-links, echo-original-headers I think this approach is problematic for several reasons, as well as thinking of this as dimensions of rawness. (also discussed in #1 and #2).

The effect of combining preferences means that there would be 6 different combinations that an implementation needs to support, and a client needs to understand, and that's not covering any other preferences. The use cases for each of the 6 preferences is unclear.

I would recommend avoiding combining preferences unless there is a clear use case for having these combinations.

Perhaps a better way to think about the Prefer is not as 'dimensions of rawness', but rather as format selection. A memento can be in Format A, or Format B, etc.. If a format is not available, a different format can be provided instead. The current implementation has taken this approach: A memento can be in the rewritten format, in banner-only format, and raw format.

There may be some formats that are extensible, like screenshot vs screenshot + clickable map, for example, but even then I'd hesitate to start combining preferences, unless absolutely needed.

Re: rewritten format This preference/format is defined so that there is a name given to the default format that is suitable for replay. A client may choose to use this format to save and serve to the user later, or perform some analysis on the modifications. Without going into implementation specific details, it may be hard to make this more specific, but perhaps could add rewritten, pywb=2.0.2 or something like that to indicate the rewriting engine used. Again, there should be specific use cases for adding additional details.

Re: http headers Unfortunately, it is not possible to send unaltered HTTP headers as they may be interpreted by the server, especially hop-by-hop headers. A possible solution would be to add a Prefer: archival-record format, where the format is a full WARC record in the body of the response. This would be easy to support for most web archives, and would allow for cleanly sending original headers + original payload.

Re: HTTP proxy mode and memento and Prefer Regarding the almost-Pattern 1.3 use case, it should be noted that this is specifically a client connecting via HTTP/S proxy mode. HTTP/S proxy mode is an important way to access web archive contact, used by British Library, and also oldweb.today, and others.

Perhaps there should be a Memento specification for proxy mode, since Prefer and Accept-Datetime are arguably more useful when using proxy mode, since there is no other way to specify a format or a date.

A client using pywb in proxy mode could do something like this to receive a raw memento at specified date.

curl -x pywb:8080 -H Prefer: raw -H Accept-Datetime: Wed, 16 Jul 2014 20:02:43 GMT http://example.com/

This usage patterns happens to be very close to Pattern 1.3 behavior, but isn't quite, and perhaps it should have a separate name. Of course the client knows that its connecting via a proxy, so there should be no confusion there.

Re: negotiation order The datetime negotiation can be thought of as happening first (technically they happen at the same time). I think this would only be an issue if a certain preference existed for only certain mementos. Currently, there isn't an example of this, but something that should be considered.

ikreymer commented 6 years ago

Rereading the descriptions at: https://mementoweb.github.io/rfc-extensions/raw-memento/#rawness I can understand the intent of this to set up a kind of constraint system on independent dimensions of rewriting. In practice though, there aren't really independent dimensions, but only a few formats that make sense and have practical applications. I thought it may be useful to list these:

These correspond to the raw, banner-only, and rewritten modes

While there is roughly a 'url rewriting' and 'content rewriting' settings, they are not independent dimensions, as url rewriting implies content rewriting. Headers need to be modified whenever any content rewriting happens.

One additional format to consider, I think this is something @ibnesayeed is interested in:

Other than that, I'm not sure there are any other options here, without delving into very implementation-specific details of rewriting systems.

Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire).

Perhaps entirely client-side rewriting approaches (such as work being done by @ibnesayeed) will require some other type of hybrid format, or maybe that would be better handled by receiving a full WARC record? (the archive-record idea)

Possible Additions

Here's a summary of some possible additional preferences, based on the comments and thoughts so far.

(Names are just preliminary and with no determination on a possible prefix).

(From Andy's suggestions):