mozilla / readability

A standalone version of the readability lib
Other
8.89k stars 604 forks source link

Improvement to the `isProbablyReaderable` heuristic for Japanese websites #429

Open JoanEspasa opened 6 years ago

JoanEspasa commented 6 years ago

I have a small sample set of URLs from some domains in Japanese where Readability does not show the button in the UI:

Those URLs visually show an article-like structure with a main body of text. Executing Readability unconditionally in them extracts the main content more or less correctly.

I've started looking into it and I saw that most of those articles use a structure for its main content like: <div class="foo"> <p> foo bar foo bar </p> <p> bar foo foo bar </p> ... </div>

isProbablyReaderable evaluates three different cases:

Therefore, the p nodes with the text content in those articles are already being considered. The problem is that most p nodes have a rather small text content, but there are lots of them. When the function tests if textContentLength < 140 to discard unimportant p nodes, it discards most of them, so score never reaches the needed threshold of 20 to make the article probablyReaderable ;)

The underlying issue here is that it seems to me that Japanese can express more ideas with less characters, but I do not fancy the idea to start adding different if branches for each language ...

Any ideas on how to make isProbablyReaderable have less false negatives?

JoanEspasa commented 6 years ago

The first idea that came to mind is that the text content in those nodes in some way must be considered as a total, and not separately. So maybe the p's should be "flattened" into its parent div .. Something like: all p nodes that have a div parent should add its textContentLength to it?

gijsk commented 6 years ago

Heh, funnily enough that (flattening things into parent nodes) is in a way what the 'real' readability parser does in some circumstances. (Certainly the parent gets part of the score from the kids, which is why the containers of "all the text" end up being selected.)

The main problem with isProbablyReaderable is speed. We don't want it to interfere with the page loading and slow down all pageloads. I'd be worried about that if we implement some kind of flattening. We could try it. I'm not sure how we'd decide when to store things on the parent of the p and then evaluate that instead of the individual ones? Another idea would be to store the previously-considered p in a local var, and if once we get to the next p the previous one is the next one's previousElementSibling we could increment scores based on that. Still, not sure how much help that'd be...

To me it seems the fundamental problem is as you say that some languages "can express more ideas with less characters". One thing that we can determine in constant time (and relatively low running time, I hope) is if the html tag has a lang attribute, or there's a meta tag for og:locale, which covers all but 1 of your examples and would let us use different logic for e.g. Japanese or Chinese text. We could get the thresholds/scores to be variables whose value we decide once based on language, after which the actual processing of the paragraphs should be just as fast as before. Does that sound reasonable?

JoanEspasa commented 6 years ago

Yes! I am aware of the speed requirements of the function. What I had in mind was a more "shallow" flattening. In fact, I like what you propose. In line 1650 there is a selection of div tags with brs inside , maybe the same could be done for these div with many p inside and consider them specially later.

Checking the lang attribute in the html tag should be cheap and usable. The downside is that we may also need to store a list of pairs: language/threshold (and previously determine them!)

Regarding the efficiency needs, is there any established way you use to measure the speed objectively? If modifications should be made, I suppose we would need it to rule out inefficient ideas to keep speed inside some reasonable margin?

gijsk commented 6 years ago

Sorry for the delay, this thread got a bit lost in my email.

For the data in terms of the words / character ratio, I think you could use table 1 from http://iovs.arvojournals.org/article.aspx?articleid=2166061#90715216 . We already use data from this paper for Reader Mode in Firefox to display an estimate of how long it'd take to read, that code lives here: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/reader/ReaderMode.jsm#610-630 - you might be able to use it for inspiration.

Funnily enough, I think I'd be more worried about how to normalize the contents of the language tags (probably need to deal with "ja", "ja-JP", and maybe "ja_JP" or whatever, plus whitespace and so on), but it should be doable...

As for the speed aspect, yes, there are benchmarks. You can run the perf-reference script with npm (see the readme as well). It's a little bit noisy on my machine, so you may want to alter it with more iterations or something (it lives in the repo), so it returns more stable numbers, before making changes. Does that help? If not feel free to ask more questions and I will do my best to provide answers. Thank you for looking into this!

JoanEspasa commented 6 years ago

These days I've been kinda busy, but a simple approach that worked wonders was to add "div > p" to this selector and then consider those div nodes as follows (substituting line 1680):

if ( node.tagName == "div"){ var textContentLength = 0; [].forEach.call(node.children, function(child) { if( child.tagName == "p" ){ textContentLength += child.textContent.trim().length; } }); else{ var textContentLength = node.textContent.trim().length; }

BTW, I'm sure there is a more elegant way of doing the same.

That paper is very interesting, haven't considered differences in languages with that metric! . Character ratios could be implemented on top of this to modify at least the 140/20 constants accordingly. I'm sorry for not being able to share and devote more time into this. Cheers!