readable-app / readability.rs

Really fast readability
17 stars 3 forks source link

The website with cookie banner doesn't work #3

Open jk-gan opened 1 year ago

jk-gan commented 1 year ago

The readable only able to capture the cookie banner text from this website. You can check at here. The website looks fine in the Firefox Reader View though.

mre commented 1 year ago

Nice catch. I don't know if it will be an easy fix, though. I'll move this to our extraction crate repo. @jfmontanaro fyi

jfmontanaro commented 1 year ago

Hi @jk-gan, thanks for bringing this up!

I think the underlying issue here is that the main article text is too short to fit the readability algorithm's idea of what should count as "main text". Since the text of the cookie policy is longer, that's what gets chosen instead. With a longer article, e.g. this one, the issue goes away.

One obvious step to take is to add cookie to the UNLIKELY_CANDIDATES regex, as anything with cookie in the id or class attribute is unlikely to be main article text. This does work to strip the cookie policy out of the result, but unfortunately in the case in question it just comes back blank. That is, it still isn't finding the article text.

As you note, Firefox's Reader Mode does include the article text in its result. Since readability.rs and Firefox Reader Mode are descendants of the same algorithm originally, it might be possible to tweak the criteria such that it works properly in this case. I do notice, however, that Firefox also includes the cookie policy in its Reader Mode view, which I would consider undesirable, so I don't think we want to copy Firefox exactly here.

I'll look into it, but this is a part of the algorithm that I haven't really messed with before so I can't say how simple or complex it might turn out to be.

jfmontanaro commented 1 year ago

Ok, I've been digging into this and I think I've discovered at least part of the root cause. It looks like the algorithm is correctly identifying the "main content" section, but is lumping the body text together with the "Other News" links, which results in it getting removed.

That is, the readability algorithm removes any elements that a) have a "class score" less than 25 and b) have a "link density" (number of text characters that are in links compared to total number of characters) greater than 0.2. Because, again, the text here is so short, the link density of the containing div ends up being about 0.45, which means it gets removed during the scoring step.

Ideally it would identify from the start that the "Other News" links are not part of the main text, and so it wouldn't pick an element that contains them as its basis. Unfortunately the <main id="main-content"> element scores higher than the <section> element, probably because of the id attribute, so it gets picked.

I'm not really sure of the best way to approach solving this. The only change I've tried that actually works is to get rid of the class_score < 25 && link_density > 0.2 condition for dumping an element, and just fall back on the link_density > 0.5 condition that's evaluated for everything. I'm worried this would reduce quality overall, though.

If anyone has alternative suggestions, I'm all ears.

mre commented 1 year ago

Perhaps instead of using 0.2 as a hardcoded cutoff for the link intensity it could use the link intensity of all elements for comparison. So the algorithm would become: calculate all link intensities of all elements and remove the elements with link intensity bigger than 80% of all other elements. Not sure if that makes sense, but maybe something you could play around with.