Changing class and id check to node name check to follow more semantic HTML

mozilla / readability

A standalone version of the readability lib

Other

8.8k stars 598 forks source link

Changing class and id check to node name check to follow more semantic HTML #848

Closed panda01 closed 5 months ago

panda01 commented 6 months ago

I added this PR to try and fix persistent issues I noticed when trying to fetch data from a website using reader view. the main fix is changing unlikelyCandidates regex check to instead check for the node.nodeName as opposed to checking the class and id names used via matchString.

I'm not really sure why we even check the class/id names, and this PR is kind of to ask that question while proposing a solution as well.

Some Examples of websites that are broken, that this fixes. (will add more) https://www.windsorstore.com/blogs/occasions/11-gorgeous-anniversary-outfit-ideas

Some problems I noticed is that with the tests they all seem to be complaining about classes or id's missing, but I'm having a hard time understanding some of the errors and spotting them. with time I'm sure I could figure it out, but maybe there is some automated way to get the diff between the actual and expected

I'm also willing to clean this up either with a squash and merge or a rebase, as well.

I'm also open to any criticism! This thing works wonderfully 95% of the time and I don't want to break it!

gijsk commented 6 months ago

I'm a bit confused by this PR. Most of the substrings in unlikelyCandidates will never match node names, only classes or IDs. There are no HTML (or XUL or SVG) tags called disqus or social, and there probably never will be. The intent of these checks is to strip nodes that probably shouldn't be candidates, or if that doesn't leave enough content, to down-score them compared to other nodes.

The net effect of your patch is likely roughly the same as eliminating these regular expressions and associated code entirely (or at least reducing the regular expressions to only article|body|main for okMaybeItsACandidate and footer|header|menu for unlikelyCandidates).

It would be useful to have more examples of the websites you're trying to fix, and a bit more detail as to what nodes are getting stripped / downscored that you think shouldn't be. Right now I'm not sure what to suggest instead.

panda01 commented 6 months ago

Thank you for getting back to me in such a timely fashion! I appreciate it, I apologize for my tardiness, I was attempting to try and gather more examples. And here are just a couple of them. I'm trying to come up with more.

This Url misses like the first few paragraphs seemingly because they have a "section" in their class name, and also not one of the okMaybeItsACandidate classes https://www.troweprice.com/personal-investing/resources/insights/retirement-savings-by-age-what-to-do-with-your-portfolio.html

This one captures some of the hidden data, but not others (the FAQ's) because of the hidden class being on the FAQ accordian and the precense of the aria-hidden="true". I would actually prefer it captured neither, or both, but it also not be hinged on the class containing something like "hidden" https://www.ondemand.labcorp.com/lab-tests/complete-blood-count

On my local instance with the char threshold lowered to 300 for this page it still collects the list of items below it. https://www.krefel.be/nl/c/airfryers

panda01 commented 6 months ago

I also want to say I very much agree with you, about the list not really being elements, I thought maybe they were a possibletagName or something. And even though I understand that if there are no top candidates, it will do a second loop, but a lot of my issues tend to stem from there being some kind of topCandidates that indeed do workout, and so it never hits that second loop.

I'm not really sure what the solution is, this is just something I tried, that seemed to work pretty well, even with the tests. I'm open to working out some kind of solution. I would just imagine that if I give semantic tags like <article /> and or <section /> would mean something to the capturing of content on the page, and under some scenarios it doesn't seem to.

panda01 commented 5 months ago

Firstly, @gijsk Thank you very much! I appreciate all of your feedback! I will close this PR though, as this is based off of the main branch, and I simply didn't realize it, if anything if you want any of the changes I can create another branch and cherry pick some of the code changes.

To answer your question above though; the extra console logs did indeed help me understand how this worked and also spot some of the issues and find out exactly why certain pages weren't being captured properly, which has been a task of mine recently for sure!

Just let me know if there are any changes you want, I guess with comments or something, and we can figure out how to get it in that marvelous repo y'all are maintaining over there. It would be my honor!