oduwsdl / MementoEmbed

A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (mementos).
MIT License
15 stars 3 forks source link

parsing multiple Link: response headers #136

Closed phonedude closed 6 years ago

phonedude commented 6 years ago

Martin brought this to my attention. Here's a sample URL:

https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/

and it returns two different Link headers:

$ curl -IL https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/ HTTP/1.1 200 OK Server: nginx/1.12.1 Content-Type: text/html; charset=UTF-8 Connection: keep-alive X-Archive-Orig-Server: nginx Date: Tue, 09 Oct 2018 21:25:48 GMT X-Archive-Orig-Transfer-Encoding: chunked X-Archive-Orig-Connection: keep-alive X-Archive-Orig-Strict-Transport-Security: max-age=86400 X-Archive-Orig-Vary: Accept-Encoding X-Archive-Orig-Vary: Cookie X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header. Link: https://wp.me/4cEB; rel=shortlink X-Archive-Orig-Content-Encoding: gzip X-ac: 3.sea _bur Memento-Datetime: Tue, 09 Oct 2018 21:25:48 GMT Link: https://ianmilligan.ca/; rel="original", https://scholarlyorphans.org/memento/https://ianmilligan.ca/; rel="timegate", https://scholarlyorphans.org/memento/timemap/link/https://ianmilligan.ca/; rel="timemap"; type="application/link-format", https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/; rel="memento"; datetime="Tue, 09 Oct 2018 21:25:48 GMT"; collection="memento" Content-Location: https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/ Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'

Now, the first header is in error (it should be in X-Archive-orig-Link), but multiple Link headers are allowed as per RFC 2616 (and 7230). Martin said MementoEmbed wasn't finding link rel="original", probably bc it occurs in the 2nd Link header and not the first.

shawnmjones commented 6 years ago

Thanks for this.

MementoEmbed uses the requests library to find the values for the Link header. requests presents all HTTP headers as a case-insensitive dictionary. If a header is specified multiple times, requests is smart enough to combine the values together, so MementoEmbed does actually get all of the values for Link.

https://github.com/oduwsdl/MementoEmbed/blob/124675643fd680804b50ab268ebcd60c958dc6a1/mementoembed/mementoresource.py#L103-L115

The problem exists in the function convert_LinkTimeMap_to_dict seen on lines 108-109 above. This function expects all relations to be surrounded by quotes (e.g., rel="timegate" is parseable, but rel=timegate fails). Link header values like <https://wp.me/4cEB>; rel=shortlink are a product of WordPress. They do not surround the argument to rel in quotes. Memento entries in the Link header, on the other hand, do surround the argument to rel in quotes (e.g., rel="timegate"). The convert_LinkTimeMap_to_dict function is stumbling over that shortlink relation because it has no quotes and it never gets to parse the rest of the string.

All examples in RFC 8288 - Web Linking use quotes, but section 3 states:

Note that any link-param can be generated with values using either the token or the quoted-string syntax; therefore, recipients MUST be able to parse both forms. In other words, the following parameters are equivalent:

x=y x="y"

and

Previous definitions of the Link header did not equate the token and quoted-string forms explicitly; the title parameter was always quoted, and the hreflang parameter was always a token. Senders wishing to maximize interoperability will send them in those forms.

So, MementoEmbed needs to support both.

I have discovered a possible solution. The requests library has its own link format parsing function. When I tested this function a few years ago, it failed miserably on parsing Memento headers, but a recent test last weekend indicates that it may have matured enough for us to use here.

shawnmjones commented 6 years ago

It looks like the requests library implementation works.

image

I will have to test with some other URI-Ms.