luin / readability

📚 Turn any web page into a clean view
2.48k stars 316 forks source link

Fix removing unlikely candidate #74

Closed dsuket closed 8 years ago

luin commented 8 years ago

Thank you for the pull request! Could you provide more details about this change?

dsuket commented 8 years ago

If the main content is wrapped in unlikelyCandidates class or id, that was deleted. For example in http://uxmilk.jp/43386, the main content is wrapped in #menu-wrap element. So, check the unlikelyCandidates with p elements. please tell me a better way how get the contents above site.

haroldtreen commented 8 years ago

@dsuket

This has also been a problem for me. But I still think the general idea behind removing unlikely candidates is good (if something just has a class of ad and nothing else, it's unlikely to have content).

My approach for fixing these problems has been to update the okMaybeItsACandidate regex. When the node matches this, it will not be removed.

I've done this a few times and those changes have been since pulled in. I've added an integration test to verify this fixes UX-Milk, and that seems to be passing.

So that page should be fixed and this change shouldn't be needed :).

Does that fix your issue @dsuket ? You can test with this commit to verify - https://github.com/luin/readability/pull/77

dsuket commented 8 years ago

awesome! thank you @haroldtreen !