mozilla / readability

A standalone version of the readability lib
Other
8.38k stars 581 forks source link

Readability fails to keep body of content from major article #603

Open trobriander opened 3 years ago

trobriander commented 3 years ago

In the following article, it gets the list of signatories (which is a table) but not the actual statement they are signing (which are a series of p elements)

https://harpers.org/a-letter-on-justice-and-open-debate/

gijsk commented 3 years ago

Yeah, the problem is that we've had repeated issues with ditching tables (which can be quite important for the meanings of articles - as is arguably the case here), so we can't just ignore them. And the table with signatories, ironically, contains more text than the actual statement (both more words, when splitting on whitespace, and more characters). I also suspect that the way readability scores ancestors and combines scores from descendant nodes will not have done the main text any favours here.

trobriander commented 3 years ago

That's exactly what's happening. The <tr> gets a higher score than the <article> tag. In fact the <article> is the second index in the topCandidates array (sorted by score) . Maybe this is being a little too primitive, but a first solution might be to see whether the next element in topCandidates is an <article> and if so, check to see whether it's an ancestor to the topCandidate node. If it passes the test, it should be the new topCandidate.

gijsk commented 3 years ago

Perhaps just giving article nodes a scoring bonus if there's only 1 such tag in the document... What are the respective scores, if you've just debugged this?

trobriander commented 3 years ago

Apologies. The values were wrong due to manual DOM manipulation (for debugging). But in fact the article was the third item in the array, with the following scores

<tr> has a score of 136 <div class="wysiwyg-content"> has a score of 79 <article> has a score of 74.5