mozilla / readability

A standalone version of the readability lib
Other
8.34k stars 579 forks source link

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

Open uqs opened 3 months ago

uqs commented 3 months ago

Hi, pretty much every article on that substack is missing the headers when turning on reader mode.

  1. Go to https://www.astralcodexten.com/p/practically-a-book-review-rootclaim and turn on reader mode
  2. The article starts with an <h1> I. Foo, followed by some <p> and then another <h1> II. Bar. The headers are not turned into headings in reader mode.
  3. The title on the page has a tagline and the data, both of them are missing, the reader mode does produce the author name though.
yagudaev commented 2 months ago

I was also surprised by how Readability handles headings.

The demo below converts a page to markdown. It first uses readability to eliminate other necessary content.

However, the heading information h1, h2, h3, is destroyed and is all h2 at the end.

CleanShot 2024-04-18 at 16 29 51@2x

Is there a way to turn this off?

Here is a live demo

cmkm commented 2 months ago

The Substack issue may be due to including header in the unlikely candidates list. We should try removing it and see if it fixes this and other Substack parsing issues.

@yagudaev: Thank you for your contribution, but I think this is unrelated to the issue @uqs filed. Would you mind opening a new issue, please?

yagudaev commented 2 months ago

@cmkm got it, I'll open a new issue 😊.

Created a new issue here: https://github.com/mozilla/readability/issues/863 -- cleaned up the demo and made the description more clear

inhumantsar commented 1 month ago

had a quick look at this. removing "header" from unlikely does prevent the h1 from being culled but it gets removed later on due to low class weight, probably because the class name is header-with-anchor-widget and widget is listed in the negative regex.

with header removed from unlikely and header added to positive does ensure all four headers appear in the output properly. this also improves the parsing on quite a few of the other test cases as well. eg: readding the rubric as well as headings for "Pros", "Cons", and "Summary" on the Engadget review, definition terms from the Google SRE test case. it does introduce a couple of issues on other pages though, like adding back the site index section on NYT pages and a pair of duplicated headings on the Mercurial test case.

i'll push up a quick draft PR for review