mozilla / readability

A standalone version of the readability lib
Other
8.38k stars 581 forks source link

Half way (?) on fixing readability for StackExchanges pages #539

Open RadhiFadlillah opened 5 years ago

RadhiFadlillah commented 5 years ago

Hi guys,

As title said, I'm working on fixing readability on StackExchanges pages. The last PR that try to fix this (#356) is dated back to 2017, and since then the Readability.js code has changed a lot. So, I decided to give it a try again.

I've made some code to fix it, but I'm not sure to put it as PR since there are still some problem to fix. Hence why I put it as issue.

The Problem

As explained in #232, the result of Readability.js in StackExchanges is spotty. Sometimes it works perfectly, sometimes we only get the question, and other times we only get the answers.

Perfect Only Question Only Answer

The Cause

I believe this problem happened because Readability.js failed to traverse parents of top candidate, to get the real article content. It's quite clear to see in the case when we only get the answers :

Readability.js failed to get the parent as real article content

This is happened because we only traverse the parent elements if and only if the top candidate doesn't have any siblings :

// If the top candidate is the only child, use parent instead. This will help sibling
// joining logic when adjacent content is actually located in parent's sibling node.
parentOfTopCandidate = topCandidate.parentNode;
while (parentOfTopCandidate.tagName != "BODY" && parentOfTopCandidate.children.length == 1) {
    topCandidate = parentOfTopCandidate;
    parentOfTopCandidate = topCandidate.parentNode;
}

While this is works in most websites, unfortunately it's not really suitable for StackExchanges, because almost every element in StackExchanges has siblings.

The Solution

To fix it, I decide to make Readability.js traverse the parent of top candidate if :

The code looks like this :

// If the top candidate is the only child, use parent instead. This will help sibling
// joining logic when adjacent content is actually located in parent's sibling node.
// However, in some sites, top candidate has multiple siblings. If so, we need to check
// whether the parent is cool enough to become the real candidate instead of the child.
parentOfTopCandidate = topCandidate.parentNode;
var parentScoreThreshold = topCandidate.readability.contentScore * 0.2;
while (parentOfTopCandidate.tagName != "BODY") {
    var parentIsCool = false;
    if (parentOfTopCandidate.readability) {
        parentIsCool = parentOfTopCandidate.readability.contentScore > parentScoreThreshold;
    }

    if (parentOfTopCandidate.children.length !== 1 && !parentIsCool) {
        break;
    }

    topCandidate = parentOfTopCandidate;
    parentOfTopCandidate = topCandidate.parentNode;

    this._initializeNode(topCandidate);
    parentScoreThreshold = topCandidate.readability.contentScore * 0.2;
}

I also made several changes to accompany these lines, which you can see in the full code in my repo.

The Good

From the change, there are several good results. First, as intended, StackExchanges works perfectly (at least in several pages that I test) :

Source Before After

This change also solve some other issues. Now hero image is included by Readability.js (#118, #166) :

Source Before After

The other fixed issue is #263 where Readability.js failed to parse when text splitted in several container :

Source Before After

The Bad

First, IMDB sites (#181) become more horrible thanks to the new code :

Source Before After

Second, all test is failed with flying color. This is because the test in Readabilty.js is done by comparing the DOM structure. Therefore, because our algorithm change, the DOM structure is changed as well, which make it failed.

To solve this, I decided to compare the old and new algorithm visually :

For example, here is the visual test result for ars-1 :

Before After Diff

In visual test, from 104 scenarios, 58 failed as in it changed from before visually. However, IMHO some of those that changed is become better than before (or at least doesn't disturb the radability of an article), so if you would, please check it by yourself. I've done the process which can be downloaded from Drive.

Conclusion

So, I guess that's how far I go for now. I'm not sure I should go in this direction, especially since most of the traditional test failed. So, what do you think ?

gijsk commented 5 years ago

Sorry it took a while to get back to this. I'm a bit puzzled. We already try to traverse up the parent tree, and score parents and grandparents of scored nodes using a "backoff" algorithm, ie https://github.com/mozilla/readability/blob/master/Readability.js#L916-L926 and the surrounding code.

In this sense, the threshold you've set is a bit strange, because it's comparatively very low. That code (per the comment) is meant to score grandparents with the sum of half the scores of their grandkids, so for any grandkid it will always outperform 0.2 * score.

The simplest way to fix pages (if that is possible) is to increase scoring of some of their semantic classes (in this case, probably either those of the container nodes themselves or those of the question/answer containers, so their ancestors get a corresponding score boost).

If you want to continue to pursue the changes you're currently considering, I'd like to understand better where the 0.2 factor comes from, and if this is functionally equivalent to just extending the code I referenced to score ancestors higher up (and if not, why not) - I think right now it's limited to parent + grandparent + great-grandparent. Ideally, I'd like to avoid having two sets of code that try to do very similar things (namely, picking ancestors higher up in the tree if all their children have comparatively high scores) in different ways, because their interaction will be hard to predict. I'd sooner extend the bit of code we already have.

Does that help?

gijsk commented 5 years ago

For IMDB, the way around failures tends to be to use the semantic classes of the content we don't want to "downscore" that content, in this case what looks like the sidebar of related content or whatever.

Oh, and the visual comparison stuff is very interesting. If there was a good way to integrate that I'd at least take patches for that separately, to be able to easily inspect failures visually. diff -w on beautified HTML output can sometimes help but you're right that the visual output is probably easier.