oduwsdl / MementoEmbed

A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (mementos).
MIT License
15 stars 3 forks source link

<META> redirects should not redirect if <META> tag is within <noscript> #90

Closed shawnmjones closed 6 years ago

shawnmjones commented 6 years ago

Example URI: http://wayback.archive-it.org/all/20160209000335/https://twitter.com/TEN_GOP/status/689216708695994368

It contains a META tag redirect on line 26:

<noscript><meta http-equiv="refresh" content="0; URL=http://wayback.archive-it.org/all/20160209000335/https://mobile.twitter.com/i/nojs_router?path=%2FTEN_GOP%2Fstatus%2F689216708695994368"></noscript>

There is content on this page, and the browser only executes the redirect if it doesn't support JavaScript.

The target of the redirect http://wayback.archive-it.org/all/20160209000335/https://mobile.twitter.com/i/nojs_router?path=%2FTEN_GOP%2Fstatus%2F689216708695994368 was not archived.

Either MementoEmbed needs to ignore content within

shawnmjones commented 6 years ago

The problem can be fixed on these lines.

https://github.com/oduwsdl/MementoEmbed/blob/385fb497f52c6942e6912738bb7f2eff7e8b53ce/mementoembed/mementoresource.py#L87-L96

I'm not crazy about editing the DOM directly to expunge all <noscript> tags. I've had issues before where BeautfulSoup destroyed most of the document when I tried to remove one tag's worth of data.

Instead, we can check if the parent is noscript:

for tag in metatags: 
    if tag.parent.name != 'noscript':
        // lines 98+ here

Of course, this will not handle weird cases where an ancestor of <meta> is <noscript>.

shawnmjones commented 6 years ago

Here is the social card that should have been produced by this tweet.

2018-07-06_17-15-46