mozilla / readability

A standalone version of the readability lib
Other
8.8k stars 598 forks source link

Problem handling invalid HTML attributes #859

Open ZachSaucier opened 5 months ago

ZachSaucier commented 5 months ago

I recently started using Readability.js for a reader view that I work on. One user reported an error with this website (note: it's paywalled so only a limited number of people can view it). The error shown in the console is:

Error handling response: Error: Failed to execute 'setAttribute' on 'Element': '@click' is not a valid attribute name.
    at Readability._simplifyNestedElements

This points to this part of Readability.js:

} else if (
  this._hasSingleTagInsideElement(node, "DIV") ||
  this._hasSingleTagInsideElement(node, "SECTION")
) {
  var child = node.children[0];
    for (var i = 0; i < node.attributes.length; i++) {
      child.setAttribute(
        node.attributes[i].name,
        node.attributes[i].value
      );
    }

I think @click as an attribute is a Vue.js thing.

Regardless, can Readability.js be updated to properly handle (probably just ignore) invalid HTML attributes instead of breaking? I can see validation of attribute names as a way forward or try/catching it and ignoring bits that error.

gijsk commented 5 months ago

The annoying thing is that we're just trying to collapse 2 element nodes together, so those "invalid" attributes are already present. But it turns out that what the HTML parser allows as attribute names (basically almost anything) is quite different from what setAttribute allows.

I asked around a bit and it seems the solution might have to be using setAttributeNode to move the actual existing attribute nodes across. That would require implementing that API in JSDOMParser and verifying that it works reasonably well in jsdom.

gijsk commented 3 months ago

The dupe had https://news.virginia.edu/content/dungeons-dragons-and-burgers-really-bad-outcomes-when-we-dont-grasp-fractions as a link that also has issues for the same reasons.