mozilla / readability

A standalone version of the readability lib
Other
8.88k stars 603 forks source link

Document how the algorithm works #9

Open DavidBruant opened 9 years ago

DavidBruant commented 9 years ago

Even if only very briefly in the readme

lofidevops commented 8 years ago

Perhaps include an example of a minimal semantic page structure that would get Readerized? (For people that want to explicitly target Reader.)

e.g. (assuming this were correct)

<html>
<title>...</title>
<body>
<article>
<h1>...</h1> <!-- included -->
<p>...</p>  <!-- included -->
</article>
<nav>...</nav>  <!-- excluded -->
</body>
</html>
gijsk commented 8 years ago

The problem is that it's not that simple... the algorithm is largely heuristics-based, on class names and IDs and the like. We can't rely on using semantic page structures because most websites don't use them, and some actively abuse them, e.g. using dozens or even hundreds of <article> elements to indicate links to articles on an overview page, people using <h2> to use larger text for the 'lead' paragraph of an article (issue #281) , etc.

lofidevops commented 8 years ago

Thanks for the info, I get that you're bravely parsing the real, messy web. I guess I'm wondering if the heuristics imply a family of ideal templates, that could in theory be targeted? And answer questions like these https://stackoverflow.com/questions/30730300/optimize-website-to-show-reader-view-in-firefox more concretely. (btw semantic stuff is just my preference)

I guess I'm concerned that if my page is Readerable today, but the heuristics evolve, it might be un-Readerable tomorrow. And my pages are not important in a global sense (compared to, say, Wikipedia or NY Times), so I wouldn't want to write a ticket just for them! It'd be easier to just target an "ideal" template (even if the ideal evolves).

Is there a concern that Readerability itself could/would be abused? (I can't imagine how/why, but I guess spammer-types have an evil ingenuity.)

dgellow commented 3 years ago

FYI, this article does a good job as describing how the library works from a high level: https://videoinu.com/blog/firefox-reader-view-heuristics/