mozilla / readability

A standalone version of the readability lib
Other
8.88k stars 603 forks source link

Not all HTML Entities are unescaped from title and other metadata (when double-escaped by websites) #820

Open maxpatiiuk opened 11 months ago

maxpatiiuk commented 11 months ago

There is a function in the code for unescaping HTML entities:

https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1353-L1365

However, it does not capture all possible HTML entities (’)

Example

On the page https://www.scientificamerican.com/podcast/episode/heres-why-actors-are-so-worried-about-ai/

There is this meta tag: <meta property="og:title" content="Here&amp;rsquo;s Why Actors Are So Worried about AI">

The page title is extracted from it.

Special html entities are supposed to be unescaped by this function, but they are not:

https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1553

The metadata.title before calling this._unescapeHtmlEntities and after is the same:

Here&rsquo;s Why Actors Are So Worried about AI

Solution & Workaround

According to https://stackoverflow.com/a/34064434/8584605, a more effective (and still safe) way to unescape HTML entities would look like this:

function htmlDecode(input) {
  const doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

Until this bug is fixed, I would be using the above function to post-process the title outputted by Readability.js

gijsk commented 10 months ago

Frankly this whole thing is confusing to me, now that I'm looking at it.

Fundamentally, this is a bug in the website, which has double-escaped its values (it's escaped the curly quote's HTML representation &rsquo; to &amp;rsquo;). The DOM engine (whatever you're using - web browser, jsdom, something else) already unescapes what's there (calling getAttribute will produce the text content in there, with the &amp; replaced with &). If it kept unescaping until there were no & bits left, that could potentially lead to bad/wrong output as well.

But by that token, manually calling _unescapeHtmlEntities at all is confusing and wrong - or at least, not needed for attributes fetched through DOM methods.

Looks like this was discussed in https://github.com/mozilla/readability/pull/590#discussion_r404396493 .

Using DOMParser isn't entirely trivial because this code can also run in node so we'll need to stub it out in JSDOMParser.js and try to access it through the document we get passed in as an argument.