Open ikreymer opened 6 years ago
Thanks for taking on this work!
Here is feedback from the Memento team in Los Alamos:
=> The terms "rewritten", "raw", "banner-only" are risky in that there could be potential for other applications of Prefer
to use them., especially if they would not be registered as per "The Registry of Preferences" of https://www.rfc-editor.org/rfc/rfc7240.txt. Apart from that, it would be nice to give the terms some kind of "branding" that refers to web archiving, memento applications. For these purposes, we had used original-
for the terms we proposed in http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html. Doing so also provides a kind of extensibility mechanism, i.e. all terms with a same prefix relate to the same framework.
=> Note that http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html included terms to convey semantics that don't seem to be covered by the terms introduced here. Ultimately, we as a community should decide what makes sense and what doesn't with this regard.
=> There is the question of which negotiation is handled first: datetime or Prefer. The Memento RFC, which was published prior to the existence of Prefer, does state that datetime negotiation is handled prior to any other content negotiation, by which was meant prior to e.g. language, format, etc negotiation. Given the goal of Prefer in the current context, it seems that this rule should also apply to Prefer, i.e. datetime negotiation first, Prefer next. Even though Prefer isn't really considered negotiation ...
=> The pywb implementation of Pattern 1.3 is really problematic from the perspective of Memento clients. Clients decide that a resource is a Memento on the basis of the existence of a Memento-Datetime header and a link with rel="original", see [1] of http://mementoweb.org/guide/resourcetype/. Strictly speaking, a client could use Memento-Datetime only to make that determination. But, when doing so, the client does not know what the original (URI-R) is and hence can not continue its time travel, e.g. to obtain another Memento of the same resource, visit the original on the live web. The link="original" is in that sense essential for Mementos (URI-M) and also for TimeGates (URI-G).
Although I appreciate the idea of having a prefix to the Prefer
options, I'm having trouble understanding original-*
. If you just want the raw response, as per the time of capture, how do you specify that? Do you need all three of original-content, original-links, original-headers
?
If so, white-listing types or re-write that should not be applied seems clumsy. If some new type or rewrite comes along (e.g. modifying video
or embedded media tags to aid playback) how does that work? Do we now need to define an original-media-elements
mode and start using it? Or am I misunderstanding it?
As with the banner-only
option, it seems there may be a preference for whitelisting the re-writes you want rather than the other way around?
I guess rewritten
is a way of saying 'all rewriting is fine', and I don't know how to express that using your original-*
either, but maybe that's fine and that what not Prefering anything means?
EDIT to further explain my confusion, isn't original-content, original-links
the same as original-content
? It's not the original content if the links have been re-written, is it?
I think two orthogonal aspects are being intertwined here:
Regarding (1): I referred to the use of the original-
prefix in the aforementioned blog post merely as an illustration of prefixing terms an sich, i.e. carving out a "namespace" of terms that pertain to the same framework/application. I did not suggest original-
was the term to be used.
Regarding (2): The blog post indeed comes from the perspective of expressing various degrees of "rawness" and, as such, in the approach described, multiple terms can indeed be combined. It doesn't necessarily have to be that way, and I did not suggest it had to be that way in the above. That is why I indicated that "Ultimately, we as a community should decide what makes sense and what doesn't with this regard". That's also why we have asked for feedback from the community ever since the blog post was published.
In my opinion, the questions at this point are:
A few more detailed considerations:
rewritten
states "fully rewritten content as needed for replay". When using a Memento client, no rewriting is needed for replay.I hope we can get some reactions to this all.
Ah, sorry @hvdsomp for not picking up that the blog post was meant to be illustrative rather than suggestive.
I have two concrete use-cases. The first is that, while attempting to integrate Mementos from multiple sources into a proxy service, I wanted to be able to request the un-rewritten entity, because none of that should be necessary in proxy mode. I had imagined this means the original headers as-is as well, but I now realise I'd not thought that through.
The second is more of a convenience, in that it would be nice to run a proxy service for users and for generating screenshots of archived resources, and in the latter case I'd want to switch off the banner. Of course, I could just set up a separate endpoint for that (in fact I'd probably do that anyway to separate out the load!), so I could live without it if it's problematic.
So, I seem to just be reiterating the main use cases covered here, but this selection of use cases does not seem broad enough to answer all the questions we have.
We could do with hearing from more users and use cases, I guess.
I have other cases that I would like to cover eventually, but they are very immature use cases that may or may not fit here. For example, as well as the standard WARC content, I also have (but have no way to give access to):
We've also considered making available:
However, given I was kind of surprised by this statement: "When using a Memento client, no rewriting is needed for replay." I'm thinking I may have entirely missed the point.
Re: Namespace
I agree that a namespace, such as memento-
or webarchive-
or wa-
should be used to avoid confusion. original-
is probably not a good choice.
Re: original-content, original-links, echo-original-headers I think this approach is problematic for several reasons, as well as thinking of this as dimensions of rawness. (also discussed in #1 and #2).
The effect of combining preferences means that there would be 6 different combinations that an implementation needs to support, and a client needs to understand, and that's not covering any other preferences. The use cases for each of the 6 preferences is unclear.
I would recommend avoiding combining preferences unless there is a clear use case for having these combinations.
Perhaps a better way to think about the Prefer is not as 'dimensions of rawness', but rather as format selection. A memento can be in Format A, or Format B, etc.. If a format is not available, a different format can be provided instead. The current implementation has taken this approach: A memento can be in the rewritten
format, in banner-only
format, and raw
format.
There may be some formats that are extensible, like screenshot vs screenshot + clickable map, for example, but even then I'd hesitate to start combining preferences, unless absolutely needed.
Re: rewritten
format
This preference/format is defined so that there is a name given to the default format that is suitable for replay. A client may choose to use this format to save and serve to the user later, or perform some analysis on the modifications. Without going into implementation specific details, it may be hard to make this more specific, but perhaps could add rewritten, pywb=2.0.2
or something like that to indicate the rewriting engine used. Again, there should be specific use cases for adding additional details.
Re: http headers
Unfortunately, it is not possible to send unaltered HTTP headers as they may be interpreted by the server, especially hop-by-hop headers. A possible solution would be to add a Prefer: archival-record
format, where the format is a full WARC record in the body of the response. This would be easy to support for most web archives, and would allow for cleanly sending original headers + original payload.
Re: HTTP proxy mode and memento and Prefer Regarding the almost-Pattern 1.3 use case, it should be noted that this is specifically a client connecting via HTTP/S proxy mode. HTTP/S proxy mode is an important way to access web archive contact, used by British Library, and also oldweb.today, and others.
Perhaps there should be a Memento specification for proxy mode, since Prefer
and Accept-Datetime
are arguably more useful when using proxy mode, since there is no other way to specify a format or a date.
A client using pywb in proxy mode could do something like this to receive a raw memento at specified date.
curl -x pywb:8080 -H Prefer: raw -H Accept-Datetime: Wed, 16 Jul 2014 20:02:43 GMT http://example.com/
This usage patterns happens to be very close to Pattern 1.3 behavior, but isn't quite, and perhaps it should have a separate name. Of course the client knows that its connecting via a proxy, so there should be no confusion there.
Re: negotiation order The datetime negotiation can be thought of as happening first (technically they happen at the same time). I think this would only be an issue if a certain preference existed for only certain mementos. Currently, there isn't an example of this, but something that should be considered.
Rereading the descriptions at: https://mementoweb.github.io/rfc-extensions/raw-memento/#rawness I can understand the intent of this to set up a kind of constraint system on independent dimensions of rewriting. In practice though, there aren't really independent dimensions, but only a few formats that make sense and have practical applications. I thought it may be useful to list these:
No content rewriting (raw). Certain HTTP headers still need to be prefixed.
Content rewriting to insert an informational banner only. HTTP headers such as Content-Length, Content-Encoding (content needs to be decoded) need to be updated. Other headers prefixed as needed. This mode is useful for HTTP/S proxy mode replay access.
Content rewriting + url rewriting for replay + banner insertion. URLs are rewritten, as well as other content in the HTML, custom JS may be inserted for client side rewriting, and an informational banner added. Content-related headers are altered as in previous mode, but also links in Location: headers are updated. This mode is useful for standard rewriting replay access.
These correspond to the raw
, banner-only
, and rewritten
modes
While there is roughly a 'url rewriting' and 'content rewriting' settings, they are not independent dimensions, as url rewriting implies content rewriting. Headers need to be modified whenever any content rewriting happens.
One additional format to consider, I think this is something @ibnesayeed is interested in:
Location
header rewritten. This is a convenience to make it easier to follow redirect mementos and are useful for a client-side aggregator. Perhaps it should be called raw-rewrite-redirects
?Other than that, I'm not sure there are any other options here, without delving into very implementation-specific details of rewriting systems.
Removing just the banner, for example, while keeping the rewriting system in place, may not be a desirable option to expose for security reasons (and a server can always set an empty banner if they desire).
Perhaps entirely client-side rewriting approaches (such as work being done by @ibnesayeed) will require some other type of hybrid format, or maybe that would be better handled by receiving a full WARC record? (the archive-record
idea)
Here's a summary of some possible additional preferences, based on the comments and thoughts so far.
(Names are just preliminary and with no determination on a possible prefix).
(From Andy's suggestions):
screenshot
- A full size screenshot, suitable for displaying at normal sizethumbnail
- A thumbnail size screenshot, suitable for displaying a thumbnailrendered-dom
- A static DOM snapshot retrieved from document.outerHTML and probably with all Githubissues.
I wanted to briefly mention a new implementation of Prefer header that is being added to pywb, as part of work for the UK Web Archive.
(The implementation is currently available on this branch in https://github.com/ukwa/pywb)
Here's a brief summary of this implementation.
TimeGate and URL-M both accept Prefer header
As a compromise between the previously suggested options, both the TimeGate and the URL-M support
Prefer
header in pywb and respond accordingly. The exact behavior is dependent on the memento negotiation pattern that is in use, as explained below.Supported Preferences
The following preferences are supported:
Prefer: rewritten
-- fully rewritten content as needed for replay. URLs may be rewritten throughout the content and and other custom changes, such as a banner, may be injected into the memento.Prefer: raw
-- original, unaltered memento. No content is altered. Certain hop-by-hop headers may be prefixed withX-Archive-Orig-
Prefer: banner-only
-- A banner is inserted into the<head>
element in one continuous block, but the content is otherwise unaltered. No links are rewritten, and no other content is modified. Certain hop-by-hop headers may be prefixed withX-Archive-Orig-
but headers are otherwise not rewritten. The banner can be easily detected with start and end markers. This preference is especially useful for proxy mode.Memento Negotiation Patterns
Since pywb actually supports multiple memento negotiation patterns defined in RFC 7089, it makes sense to have the
Prefer
header behavior also correspond to the negotiation pattern already in use.Pattern 2.1 -- 302 Style Negotiation (spec)
When using 302 (*) style negotiation in pywb, the
Prefer
header results in a redirect to the 'canonical' url representing that format. The redirect happens when the Prefer header is present on either a URL-G and URL-M request. ThePreference-Applied
header is served with the response.Pattern 2.2 -- 200 Style Negotiation (spec)
When using 200 style negotiation, the
Prefer
header can also be applied on URL-G or URL-M, and the desired resource is served directly, with the correctPreference-Applied
header. TheContent-Location
header is set with the canonical representation of the resource.This mode is the default in pywb 2.0
Pattern 1.3 -- 200 Style Negotiation (spec)
The Pattern 1.3 pattern (**) is the proxy mode behavior, where the user connects to pywb via an HTTP/S proxy and no url rewriting is performed. The
Prefer
header is also supported in this mode, and the Preference-Applied is returned in response. Since URL-M = URL-G = URL-R in this mode, no redirect or alternative Content-Location is included. The Prefer header is especially useful for requesting different format resources since no unique canonical urls exist.This mode only supports
raw
andbanner-only
preferences. IfPrefer: rewritten
is requested, the response is actually the banner-only memento, eg.Preference-Applied: banner-only
Canonical Url Representations
For non-proxy mode replay (Pattern 2.1 and 2.2), each preference corresponds to a 'canonical' urls, which are:
raw
-http://host/prefix/<timestamp>id_/<url>
banner-only
-http://host/prefix/<timestamp>bn_/<url>
The canonical representation for
rewritten
is also changes if running in framed or frameless replay:rewritten
ishttp://host/prefix/<timestamp>mp_/<url>
rewritten
ishttp://host/prefix/<timestamp>/<url>
Request for feedback
Let me know if these is any feedback on this implementation, or other thoughts. Some of this may be particular to the pywb implementation, but some of this behavior may make sense to standardize further or change.
If anyone is interested in code, here are a few unit tests that test for this behavior:
Prefer header patterns 2.1 and 2.2
Proxy mode tests, including Prefer header patterns 1.3
Notes
*: pywb actually uses 307 redirects instead of 302 *: pywb almost* supports pattern 1.3 fully, but can not include a link to
rel=original
in proxy mode, as it is not available. Nevertheless, the behavior is otherwise identical to pattern 1.3, but perhaps there should be another name for it?