mozilla / readability

A standalone version of the readability lib
Other
8.68k stars 594 forks source link

Reader mode button doesn't appear on Project Zero blogspot post #503

Open evilpie opened 5 years ago

evilpie commented 5 years ago

The reader mode button doesn't appear in the urlbar for: https://googleprojectzero.blogspot.com/2018/12/on-vbscript.html

about:reader?url=https://googleprojectzero.blogspot.com/2018/12/on-vbscript.html however seems to mostly work, just the code snippets are a bit mangled.

gijsk commented 5 years ago

The issue is that this page uses a pile of <div>s with no <br>s and it therefore doesn't pass the simple detection algorithm in https://dxr.mozilla.org/mozilla-central/source/toolkit/components/reader/Readability-readerable.js .

We can't just add all <div>s to that because it would cause the detection speed to plummet.

This can easily be fixed by the site by switching to semantic markup ie using <p> for paragraph text.

EvsChen commented 5 years ago

I noticed in the website their first para has a textContent.length of 523. However, our criteria is >540 for "paragraphs". But it's something that should be "readerable".

gijsk commented 5 years ago

I noticed in the website their first para has a textContent.length of 523. However, our criteria is >540 for "paragraphs". But it's something that should be "readerable".

Sorry, what bit of code are you referring to? Off-hand I don't see 540 anywhere in our codebase. Is my diagnosis in https://github.com/mozilla/readability/issues/503#issuecomment-449048740 not correct? The reader mode button in Firefox is governed by the code in https://github.com/mozilla/readability/blob/master/Readability-readerable.js and I think we're just not picking things up at all because the text content is in <div>s instead of <p>s...

EvsChen commented 5 years ago

540 because we have score += Math.sqrt(textContent.length - 140). And we would return true for score > 20. So the minimum is 540.

The problem here is about <div> and <p>. But in most cases we cannot expect the site owner to change that. I'm just wondering if there is any way for us to regard this kind of <div> as <p>.

gijsk commented 5 years ago

540 because we have score += Math.sqrt(textContent.length - 140). And we would return true for score > 20. So the minimum is 540.

Right, but that can be reached by multiple paragraphs, right? The entire article as a whole should be big enough.

The problem here is about <div> and <p>. But in most cases we cannot expect the site owner to change that. I'm just wondering if there is any way for us to regard this kind of <div> as <p>.

Yeah... There's no classes on the <div>s themselves though, only on some of the containers. Perhaps, like for #420 , we need some per-domain rules to fix this.