How to check if page is readable?

luin / readability

📚 Turn any web page into a clean view

2.49k stars 313 forks source link

How to check if page is readable? #78

Open Feelnoobskill opened 8 years ago

Feelnoobskill commented 8 years ago

I want to check if the page is readable or not. Is that possible?

haroldtreen commented 8 years ago

What do you mean by this?

When you use readability, the article returned will be null if nothing was found. Otherwise it will return whatever part of the article it determined to be the main content.

I've been wondering how to flag articles that aren't being extracted properly (eg. a description has been extracted rather then the article). My current approach is to look at the size of the article html vs. input html. I've found that if the content html is < 3% of the original article - chances are the main article was missed.

Does that help @Feelnoobskill ? Or maybe you can elaborate what you picture the solution looking like?

Feelnoobskill commented 8 years ago

@haroldtreen thanks for the response. Basically, I would like to create reader mode like iOS Safari has.

Meaning that some pages are not suitable for opening in reader mode (for example stackoverflow home page). Right now node-readability will extract some random text from webpage and this is not acceptable in my case . So i was thinking maybe someone already faced with this problem and can share their experience.

haroldtreen commented 8 years ago

Ah. Interesting. I wasn't aware that iOS did that.

Some ideas:

You could look at the metadata to determine if the page is an article. Use that to remove the reader/force-show the reader button. For example:

<meta property="og:type" content="article">

You could run readability and check the % reduction. As stated before, I find ~2.5-3% to be a good metric for something going wrong.
Run readability and look at the length of the output. Reader mode will probably only be useful if the length the output is > X characters/lines.

The tag stuff might be the closest stuff to being able to say yes/no without actually running the algorithms on the page.

NinoSkopac commented 6 years ago

This is good info @haroldtreen.

It would be great if the library had an API for this (eg isReadable)