Closed phonedude closed 6 years ago
Thanks for this.
MementoEmbed uses the requests
library to find the values for the Link
header. requests
presents all HTTP headers as a case-insensitive dictionary. If a header is specified multiple times, requests
is smart enough to combine the values together, so MementoEmbed does actually get all of the values for Link
.
The problem exists in the function convert_LinkTimeMap_to_dict
seen on lines 108-109 above. This function expects all relations to be surrounded by quotes (e.g., rel="timegate"
is parseable, but rel=timegate
fails). Link header values like <https://wp.me/4cEB>; rel=shortlink
are a product of WordPress. They do not surround the argument to rel
in quotes. Memento entries in the Link
header, on the other hand, do surround the argument to rel
in quotes (e.g., rel="timegate"
). The convert_LinkTimeMap_to_dict
function is stumbling over that shortlink
relation because it has no quotes and it never gets to parse the rest of the string.
All examples in RFC 8288 - Web Linking use quotes, but section 3 states:
Note that any link-param can be generated with values using either the token or the quoted-string syntax; therefore, recipients MUST be able to parse both forms. In other words, the following parameters are equivalent:
x=y x="y"
and
Previous definitions of the Link header did not equate the token and quoted-string forms explicitly; the title parameter was always quoted, and the hreflang parameter was always a token. Senders wishing to maximize interoperability will send them in those forms.
So, MementoEmbed needs to support both.
I have discovered a possible solution. The requests
library has its own link format parsing function. When I tested this function a few years ago, it failed miserably on parsing Memento headers, but a recent test last weekend indicates that it may have matured enough for us to use here.
It looks like the requests
library implementation works.
I will have to test with some other URI-Ms.
Martin brought this to my attention. Here's a sample URL:
https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/
and it returns two different Link headers:
$ curl -IL https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/ HTTP/1.1 200 OK Server: nginx/1.12.1 Content-Type: text/html; charset=UTF-8 Connection: keep-alive X-Archive-Orig-Server: nginx Date: Tue, 09 Oct 2018 21:25:48 GMT X-Archive-Orig-Transfer-Encoding: chunked X-Archive-Orig-Connection: keep-alive X-Archive-Orig-Strict-Transport-Security: max-age=86400 X-Archive-Orig-Vary: Accept-Encoding X-Archive-Orig-Vary: Cookie X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header. Link: https://wp.me/4cEB; rel=shortlink X-Archive-Orig-Content-Encoding: gzip X-ac: 3.sea _bur Memento-Datetime: Tue, 09 Oct 2018 21:25:48 GMT Link: https://ianmilligan.ca/; rel="original", https://scholarlyorphans.org/memento/https://ianmilligan.ca/; rel="timegate", https://scholarlyorphans.org/memento/timemap/link/https://ianmilligan.ca/; rel="timemap"; type="application/link-format", https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/; rel="memento"; datetime="Tue, 09 Oct 2018 21:25:48 GMT"; collection="memento" Content-Location: https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/ Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Now, the first header is in error (it should be in X-Archive-orig-Link), but multiple Link headers are allowed as per RFC 2616 (and 7230). Martin said MementoEmbed wasn't finding link rel="original", probably bc it occurs in the 2nd Link header and not the first.