mozilla / readability

A standalone version of the readability lib
Other
8.79k stars 598 forks source link

parentNode could be null #757

Open danielepolencic opened 2 years ago

danielepolencic commented 2 years ago

I can see the following error when parsing this page:

ReadabilityError: Evaluation failed: TypeError: Cannot read properties of null (reading 'replaceChild')
    at Readability._setNodeTag (__puppeteer_evaluation_script__:640:21)
    at Readability._grabArticle (__puppeteer_evaluation_script__:993:25)
    at Readability.parse (__puppeteer_evaluation_script__:2249:31)
    at __puppeteer_evaluation_script__:2286:49
    at __puppeteer_evaluation_script__:2288:7

Note: as you can see from the stack trace, this code runs in Puppeteer.

The error suggests that in this line parentNode might be null.

I'm not 100% sure why this happens (maybe the parent is a fragment or the document itself?) and what are the consequences, but I was wondering if we could introduce a check for that.

I'm happy to raise a PR to check if the parentNode is null. If it is, I think I can propose the following:

gijsk commented 2 years ago

Note: as you can see from the stack trace, this code runs in Puppeteer.

In what browser? What DOM library is being used? Have you done something to prevent page script from running at the same time (and e.g. responding to DOM removals/changes using events or mutation observers or similar mechanisms) ?

Basically, please can you provide more detailed steps to reproduce?

(maybe the parent is a fragment or the document itself?)

Neither of these should be possible - the callsite in _grabArticle is in a conditional that checks that node is a div, so it shouldn't be the root node, and the parser goes from the root through the rest of the doc, so shouldn't be able to end up in a fragment (as the fragment can't be a child of any element in the doc).

I'm happy to raise a PR to check if the parentNode is null. If it is, I think I can propose the following:

That would fix the symptom, but we should fix the cause instead. If we're iterating over the doc, it should not be possible to end up on a node without a parentNode - it indicates a bug elsewhere. We should find and fix that bug instead.

danielepolencic commented 2 years ago

In what browser?

Chrome

What DOM library is being used? Have you done something to prevent page script from running at the same time (and e.g. responding to DOM removals/changes using events or mutation observers or similar mechanisms) ?

We wait for the page to load and then evaluate the readability script with:

await page.goto(url, { waitUntil: 'networkidle0' })
const html = await page.content() // HTML before readability
const res = (await page.evaluate(`(() => {
      ${script}
      const article = new Readability(document).parse()
      return article ? { content: article.content, contentAsText: article.textContent } : {content: undefined, contentAsText: undefined}
    })()`))
// use res elsewhere

Where script is the content of @mozilla/readability/Readability.js. That code is wrapped into a try ... catch and the error that we see is the stack trace I included in the previous message.

Neither of these should be possible - the callsite in _grabArticle is in a conditional that checks...

Is it possible that the node has been removed from the doc already, but we still hold a reference to it? I'm wondering if the paywall might cause the issue here.

msujaws commented 2 years ago

While I agree in finding the root cause, checking for null here still seems like it should be done.