mozilla / readability

A standalone version of the readability lib
Other
8.8k stars 598 forks source link

The verge: first sentence is skipped #847

Open KraXen72 opened 6 months ago

KraXen72 commented 6 months ago

https://www.theverge.com/2022/10/28/23428132/elon-musk-twitter-acquisition-problems-speech-moderation this is likely due to it being short and having an a-href in it.

inhumantsar commented 5 months ago

took a look at this today and it is indeed because >1/2 of it is a link. it fails cleanConditionally at weight >= 25 && linkDensity > 0.5 where weight==25 and linkDensity is just over 0.6. maybe related: it seems to be the only paragraph in the article that is still wrapped with a div after preprocessing.

i tinkered with cleanConditionally and getLinkDensity in a few different ways and was able to get a clean result from this article which included the missing sentence, but all of those changes had significant negative impacts across several other test cases.

there's a comment on this function which suggests taking the original content score into account, but it seems like that would require a fairly significant refactor and would likely come with its own set of negative knock-on effects. it's not something i'd want to jump into without input from the maintainers.

if someone more familiar with the project can suggest a direction to take this, i'd be happy to implement it.