mozilla / readability

A standalone version of the readability lib
Other
8.96k stars 607 forks source link

Readability mode drops hyperlinked paragraph #878

Open swethapillai opened 4 months ago

swethapillai commented 4 months ago

Hi all, I am experiencing an issue with applying readability to the following webpage: https://www.hartlepoolmail.co.uk/news/crime/seven-hartlepool-men-appear-in-court-charged-with-the-murder-of-michael-phillips-637809

This webpage has two paragraphs that are hyperlinked:

<div class="Markup__ParagraphWrapper-sc-13q6ywe-0 iWyzoK markup ">
<p><a href="https://www.hartlepoolmail.co.uk/news/crime/live-updates-seven-hartlepool-men-appear-court-charged-murder-michael-phillips-637633" rel="nofollow" target="_blank" data-vars-event="gaEvent" data-vars-ec="navigation" data-vars-ea="in article" data-vars-el="plain links" data-vars-aidclick="637633" data-vars-titleclick="On Friday September 27, John Musgrave, 54, Sean Musgrave, 30, both Wordsworth Avenue, Hartlepool, and Anthony Small, 39, of Rydal Street, Hartlepool all denied the charge of murder. " data-vars-urlclick="https://www.hartlepoolmail.co.uk/news/crime/live-updates-seven-hartlepool-men-appear-court-charged-murder-michael-phillips-637633"><strong>On Friday September 27, John Musgrave, 54, Sean Musgrave, 30, both Wordsworth Avenue, Hartlepool, and Anthony Small, 39, of Rydal Street, Hartlepool all denied the charge of murder. </strong></a></p>
</div>
<div class="Markup__ParagraphWrapper-sc-13q6ywe-0 iWyzoK markup ">
<p><strong><a href="https://www.hartlepoolmail.co.uk/news/crime/niramax-boss-neil-elliott-and-co-accused-face-trial-next-year-after-denying-murder-of-michael-phillips-472180" rel="nofollow" target="_blank" data-vars-event="gaEvent" data-vars-ec="navigation" data-vars-ea="in article" data-vars-el="plain links" data-vars-aidclick="472180" data-vars-titleclick="Previously, a director of Niramax waste management firm, Neil Elliott, 44, of Briarfield Close, Hartlepool, and co-accused Lee Darby, 31, of Ridley Court, Hartlepool denied killing Mr Phillips. " data-vars-urlclick="https://www.hartlepoolmail.co.uk/news/crime/niramax-boss-neil-elliott-and-co-accused-face-trial-next-year-after-denying-murder-of-michael-phillips-472180">Previously, a director of Niramax waste management firm, Neil Elliott, 44, of Briarfield Close, Hartlepool, and co-accused Lee Darby, 31, of Ridley Court, Hartlepool denied killing Mr Phillips. </a></strong></p>
</div>

However, readability drops the second hyperlinked paragraph, and only keeps the first hyperlinked paragraph amongst the other extracted text.

image image

I believe it could be an issue with readability and the density of commas in these wrapped sections of text. I found that when I remove a single comma from the first hyperlinked paragraph and then apply readability mode - both hyperlinked paragraphs show up.

The way in which this was found was:

  1. Open webpage in firefox
  2. Click on readability icon

Is this an expected limitation of the package?

gijsk commented 4 months ago

Is this an expected limitation of the package?

Yes and no. :-)

No, as in, we should capture all the article text.

Yes in the sense that, the code tries to determine whether paragraphs and links are relevant based on the number of commas / text / links in them. Though the outcome is surprising to me here - running the processing with debug turned on might reveal why this is happening.

swethapillai commented 4 months ago

Ran processing with debug turned on and it looks like we drop the missing text string in the Article content post-prep stage of readability. I.e. we successfully grab the missing string: 'Previously, a director of...' for the first stage: Pre-GrabArticle and the second: Article content pre-prep but drop this text in Article content post-prep.

Debug output: HartlepoolNiramaxReadabilityDebugLogOutput.txt

swethapillai commented 4 months ago

It appears that the text is being dropped in the following location:

PrepArticle() > this.CleanConditionally(articleContent, "div"); (line 582) where flag1 is false and flag2, which is the output, is true. The value of flag2 is determined by the following condition:

bool flag2 = 
    (double) num3 > 1.0 && (double) num2 / (double) num3 < 0.5 && !Reader.HasAncestorTag(node, "figure") ||
    !flag1 && (double) num4 > (double) num2 ||
    (double) num5 > Math.Floor((double) num2 / 3.0) ||
    !flag1 && (double) textDensity < 0.8999999761581421 && ((length1 < 25 ? 1 : 0) & (num3.CompareTo(0.0f) == 0 ? 1 : ((double) num3 > 2.0 ? 1 : 0))) != 0 && !Reader.HasAncestorTag(node, "figure") ||
    !flag1 && classWeight < 25 && linkDensity > 0.20000000298023224 ||
    (double) classWeight >= 25.0 && linkDensity > 0.5 ||
    num6 == 1 && length1 < 75 ||
    num6 > 1;

where, in this case, the following condition is true:

!flag1 &&
(double) textDensity < 0.8999999761581421 &&
((length1 < 25 ? 1 : 0) & (num3.CompareTo(0.0f) == 0 ? 1 : ((double) num3 > 2.0 ? 1 : 0))) != 0 &&
!Reader.HasAncestorTag(node, "figure")
gijsk commented 4 months ago

Can you try with current main HEAD? That would provide better logging output.

What you've shared so far is confusing, as the variable names and method calls all appear to have been changed; are you using some other version of the package (not the plain JS in this github repo) ? Anyway - length1 here should be the innerText.length of the node in question, which is way longer than 25, so I don't understand how that could be true here. On main the logging will indicate exactly which condition is failing and on what grounds.

swethapillai commented 4 months ago

Ahh sorry! We are using a package that uses Readability within it *SmartReader. Hence why the output is not as you expected. However the issue does still show up when viewing the webpage using readability via the Firefox browser. The debug logs, using readability from this repo, also show that we drop the text string in: Article content post-prep Debug logs: ReadabilityDebugOutput.txt Here's the repo where we tested this out - if you want to have a deeper look.

george-kirkman commented 4 months ago

@gijsk Did you get a change to look at this since the correct output logs?

gijsk commented 4 months ago

I'm sorry, I haven't - I will put it on my list for next week. The debug output unfortunately still didn't have what I was hoping for - the logging here I think should be hit for the code described in this comment.

gijsk commented 4 months ago

So a log from the commandline when generating a testcase and then running the debug-testcase.js script against the same source produces:

Reader: (Readability) Cleaning Conditionally <div class="Markup__ParagraphWrapper-sc-13q6ywe-0 bhZZYl markup ">
Reader: (Readability) Checks failed [ 'Low weight and a little linky. (linkDensity=1)' ]

This makes sense - all the text in that paragraph is contained in a link (so the link density is 1).

I found that when I remove a single comma from the first hyperlinked paragraph and then apply readability mode - both hyperlinked paragraphs show up.

This would make less sense to me - but also I cannot reproduce it and you didn't specify which comma you removed... instead, I see removing a comma from that first para causing that para to also be removed...

I think the correct fix is probably to specialcase situations where a single <a> tag is the entire content of a <p> or even <h1>/<h2>/... tag and strip out the link but keep its text, or something - assuming the text is long enough (for some meaning of long enough).