Open ikreymer opened 7 years ago
Looking at original-links
again, if used without original-content
, it may be too ambiguous to even be useful to return the version of a page served for proxy mode replay.
For example, given an original page:
<html>
<body>
<a href="http://example.com/">Example</a>
</body>
</html>
With original-links
but not original-content
, it seems like it would be possible to include injected content, as long as it doesn't contain any links:
The following would still be a valid response, under original-links
:
<html>
<!-- Custom Archive Insert -->
<head>
<script>
...
</script>
</head>
<!-- End custom archive insert -->
<body>
<a href="http://example.com/">Example</a>
</body>
</html>
But the following would not, because now a <script>
is included, thus adding a new link.
<html>
<!-- Custom Archive Insert -->
<head>
<script src="/static/custom_rewrite.js"></script>
</head>
<!-- End custom archive insert -->
<body>
<a href="http://example.com/">Example</a>
</body>
</html>
Yet, the two may be functionally identical, a page with a custom injected script.
If was just to query all script tags, there is not necessarily a clear way to distinguish the injected script from the original contents of the page.
The point is to illustrate that original-links
without original-content
seems potentially ambiguous, and the use case for it seems unclear.
A user can either get a 'raw' page with original content, which must include original links, or they get a 'rewritten' page and must now parse the page to determine what has been rewritten and what hasn't.
Reading the definition of
original-content
,original-links
andecho-original-headers
, this seems rather more complicated than it needs to be.If the archive serves
original-content
, then by definition links should also be original, and it should also implyecho-original-headers
. It is unclear why there would be original headers without original content, or vice versa. The headers do need to be prefixed as otherwise could cause issues if they are interpreted by the server HTTP server.I suppose the example of
original-links
would be something like proxy mode, where there may be a custom insert, but links are otherwise unchanged.. but I can't think of many other use cases for wanting to retrieve this.What is really needed is a standard method to return a 'raw' memento, perhaps standardizing what the
id_
field return, or adding another URL endpoint. The HTTP headers + payload could be served as a WARC record, eliminating the need to prefix the http headers. For example, a dimension of rawness could bewarc
that returns a standard WARC record in the HTTP response, containing original or generated WARC headers + original HTTP headers + original HTTP payload.This should work for most use cases that require a raw memento, since it is probably used for further processing, and not to be displayed directly in a browser to a user.