mozilla / readability

A standalone version of the readability lib
Other
8.8k stars 598 forks source link

Numerical Commas Fix #853

Open panda01 opened 6 months ago

panda01 commented 6 months ago

Added a fix to not count commas inside of things with prices like 5,99 etc to give a more accurate content score.

One example of this causing problems is on this page https://www.krefel.be/nl/c/airfryers. However I think you also have to lower the char threshold for it to work

cmkm commented 6 months ago

It'd be helpful to get a little more detail here. Is there a specific part of this page (such as the description) you'd like to be included in the results? What's the character threshold value to see the expected results?

panda01 commented 6 months ago

@cmkm Sorry had a hectic past few days.

When Readability tries to grab the content it only grabs the product list and the descriptions below it. In my investigations I found that the comma detection was giving the product list extra points (like 55 was the score) because of the commas it was detecting in the prices like 109,00.

I just am assuming here that detecting commas in prices, numbers, etc was not the original purpose of that code, and we really don't care about those anyway if we're looking for readable content. So i just changed the Regex to only look for commas surrounded by non digits. Which def helped with the extraction for the specific page above, and gave me that first paragraph and all of the content below. Which is def closer I would say to what a user might expect when using readability on that page

cmkm commented 5 months ago

Thank you! This seems reasonable, but before we merge it, would you mind please adding a test case by using the instructions here? If you have an example of a page with slightly more readerable content, that would be ideal. Thanks! :)

panda01 commented 5 months ago

@cmkm Awesome, I definitely would love to contribute! However I did try and run that script to no avail, and even went so far as adding some more console.log's on it to see where it's getting hung up.

Below is a screenshot of my modifications and the output when I try and run the command.

Screenshot 2024-04-11 at 3 19 47 PM

I'm not sure what I'm doing wrong here, so any guidance is greatly appreciated!

cmkm commented 5 months ago

Thanks for your patience here! I think what you're running into is the same thing that I was, which is that this website may be blocking certain user agents, so your requests are timing out. Is there another website test case that your patch fixes that you might be able to try instead?

panda01 commented 4 months ago

Busy couple of weeks there! I can def try and find some new source to check this. Just was thinking what happened with this PR

gijsk commented 3 months ago

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

panda01 commented 3 months ago

So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest.

Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed.

@gijsk This is an image of the website on my ff v127.0 64bit apple intel. I don't see the above mentioned section "Wil je graag een airfryer kopen?"

Screenshot 2024-07-01 at 9 31 54 AM

Also, with this particular website, I believe you also have to lower the character threshold in order for this to work. Either way, the product list shouldn't be counted so high in the algorithm.

The reason why it is counted so high is because of the commas being counted in the prices, which is why this PR works.

panda01 commented 4 weeks ago

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

gijsk commented 5 days ago

Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently.

Well, we'd really like to make sure that the patch works and that we don't break it in the future. In order to do that it'd be helpful to have a testcase as part of the pull request that checks the output for a site where this PR makes a difference. As I said, I tried to generate one based on the site linked in the initial comment but the outcome of the changes here is that the produced text is worse than before the changes.

It sounds like you're confident that this is a positive change so it would be helpful if you could include a testcase that demonstrates this.