Open maxpatiiuk opened 11 months ago
Frankly this whole thing is confusing to me, now that I'm looking at it.
Fundamentally, this is a bug in the website, which has double-escaped its values (it's escaped the curly quote's HTML representation ’
to ’
). The DOM engine (whatever you're using - web browser, jsdom
, something else) already unescapes what's there (calling getAttribute
will produce the text content in there, with the &
replaced with &
). If it kept unescaping until there were no &
bits left, that could potentially lead to bad/wrong output as well.
But by that token, manually calling _unescapeHtmlEntities
at all is confusing and wrong - or at least, not needed for attributes fetched through DOM methods.
Looks like this was discussed in https://github.com/mozilla/readability/pull/590#discussion_r404396493 .
Using DOMParser
isn't entirely trivial because this code can also run in node
so we'll need to stub it out in JSDOMParser.js
and try to access it through the document we get passed in as an argument.
There is a function in the code for unescaping HTML entities:
https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1353-L1365
However, it does not capture all possible HTML entities (
’
)Example
On the page https://www.scientificamerican.com/podcast/episode/heres-why-actors-are-so-worried-about-ai/
There is this meta tag:
<meta property="og:title" content="Here&rsquo;s Why Actors Are So Worried about AI">
The page title is extracted from it.
Special html entities are supposed to be unescaped by this function, but they are not:
https://github.com/mozilla/readability/blob/2524fe371da2356b0bb79e0d34b028fa23388cd3/Readability.js#L1553
The metadata.title before calling
this._unescapeHtmlEntities
and after is the same:Here’s Why Actors Are So Worried about AI
Solution & Workaround
According to https://stackoverflow.com/a/34064434/8584605, a more effective (and still safe) way to unescape HTML entities would look like this:
Until this bug is fixed, I would be using the above function to post-process the title outputted by Readability.js