Open DavidBruant opened 9 years ago
Perhaps include an example of a minimal semantic page structure that would get Readerized? (For people that want to explicitly target Reader.)
e.g. (assuming this were correct)
<html>
<title>...</title>
<body>
<article>
<h1>...</h1> <!-- included -->
<p>...</p> <!-- included -->
</article>
<nav>...</nav> <!-- excluded -->
</body>
</html>
The problem is that it's not that simple... the algorithm is largely heuristics-based, on class names and IDs and the like. We can't rely on using semantic page structures because most websites don't use them, and some actively abuse them, e.g. using dozens or even hundreds of <article>
elements to indicate links to articles on an overview page, people using <h2>
to use larger text for the 'lead' paragraph of an article (issue #281) , etc.
Thanks for the info, I get that you're bravely parsing the real, messy web. I guess I'm wondering if the heuristics imply a family of ideal templates, that could in theory be targeted? And answer questions like these https://stackoverflow.com/questions/30730300/optimize-website-to-show-reader-view-in-firefox more concretely. (btw semantic stuff is just my preference)
I guess I'm concerned that if my page is Readerable today, but the heuristics evolve, it might be un-Readerable tomorrow. And my pages are not important in a global sense (compared to, say, Wikipedia or NY Times), so I wouldn't want to write a ticket just for them! It'd be easier to just target an "ideal" template (even if the ideal evolves).
Is there a concern that Readerability itself could/would be abused? (I can't imagine how/why, but I guess spammer-types have an evil ingenuity.)
FYI, this article does a good job as describing how the library works from a high level: https://videoinu.com/blog/firefox-reader-view-heuristics/
Even if only very briefly in the readme