mozilla / readability

A standalone version of the readability lib
Other
8.9k stars 604 forks source link

Content from 1st div nested in another div is missing in Readability mode #704

Open mattcobb opened 3 years ago

mattcobb commented 3 years ago

On this nvida support page:

https://nvidia.custhelp.com/app/answers/detail/a_id/5149

The content under Steps for bullets 1 and 9 go missing in readability mode. I don't why these extra divs are created in the 1st place, but that's how it is. The pattern in abbreviated form is like this:

<div><div>1</div><div>2</div>...<div>8</div></div>
<div><div>9</div><div>10</div>...<div>14</div>...</div>

where the divs containing numbers 1 and 9 go missing - the 1st div nested in an outer div.

I tried this on mac firefox 90.0.2. Load the page, then click the Toggle Reader View

mattcobb commented 3 years ago

Did some debugging into this. The reason the nodes are removed is their weight is 0 and linkDensity is 0.29, which trips this condition in _removeNodes:

https://github.com/mozilla/readability/blob/28843b6de84447dd6cef04058fda336938e628dc/Readability.js#L2115

(!isList && weight < 25 && linkDensity > 0.2) ||

gijsk commented 3 years ago

Thanks for the report, and for taking time to debug this, that's super helpful!

Yeah, this looks like a case where there's very little non-link text in a container, so readability decides to omit it. The website doesn't use semantic markup for the list, or (as you can tell from the condition) we wouldn't necessarily omit it if the other items in the list were worthwhile. :-(

I'm not sure how to fix this without regressing other cases where we end up removing "contact us" links and so on that are on their own; the DOM from the website is very bare (no classes or anything else to go on) so I don't see a lot of options, unfortunately... open to suggestions though!

mattcobb commented 3 years ago

Looking into it further, the 1st div enclosed in the outer div is left as a div but the subsequent ones (2..7) are converted to a p tag so they get kept. Why isn't the 1st one also converted to a p?