mementoweb / rfc-extensions

0 stars 0 forks source link

Dimensions of Rawness seem too complicated #1

Open ikreymer opened 7 years ago

ikreymer commented 7 years ago

Reading the definition of original-content, original-links and echo-original-headers, this seems rather more complicated than it needs to be.

If the archive serves original-content, then by definition links should also be original, and it should also imply echo-original-headers. It is unclear why there would be original headers without original content, or vice versa. The headers do need to be prefixed as otherwise could cause issues if they are interpreted by the server HTTP server.

I suppose the example of original-links would be something like proxy mode, where there may be a custom insert, but links are otherwise unchanged.. but I can't think of many other use cases for wanting to retrieve this.

What is really needed is a standard method to return a 'raw' memento, perhaps standardizing what the id_ field return, or adding another URL endpoint. The HTTP headers + payload could be served as a WARC record, eliminating the need to prefix the http headers. For example, a dimension of rawness could be warc that returns a standard WARC record in the HTTP response, containing original or generated WARC headers + original HTTP headers + original HTTP payload.

This should work for most use cases that require a raw memento, since it is probably used for further processing, and not to be displayed directly in a browser to a user.

ikreymer commented 7 years ago

Looking at original-links again, if used without original-content, it may be too ambiguous to even be useful to return the version of a page served for proxy mode replay.

For example, given an original page:

<html>
<body>
<a href="http://example.com/">Example</a>
</body>
</html>

With original-links but not original-content, it seems like it would be possible to include injected content, as long as it doesn't contain any links:

The following would still be a valid response, under original-links:

<html>
<!-- Custom Archive Insert -->
<head>
<script>
...
</script>
</head>
<!-- End custom archive insert -->
<body>
<a href="http://example.com/">Example</a>
</body>
</html>

But the following would not, because now a <script> is included, thus adding a new link.

<html>
<!-- Custom Archive Insert -->
<head>
<script src="/static/custom_rewrite.js"></script>
</head>
<!-- End custom archive insert -->
<body>
<a href="http://example.com/">Example</a>
</body>
</html>

Yet, the two may be functionally identical, a page with a custom injected script.

If was just to query all script tags, there is not necessarily a clear way to distinguish the injected script from the original contents of the page.

The point is to illustrate that original-links without original-content seems potentially ambiguous, and the use case for it seems unclear.

A user can either get a 'raw' page with original content, which must include original links, or they get a 'rewritten' page and must now parse the page to determine what has been rewritten and what hasn't.