Open uqs opened 3 months ago
I was also surprised by how Readability handles headings.
The demo below converts a page to markdown. It first uses readability to eliminate other necessary content.
However, the heading information h1, h2, h3, is destroyed and is all h2 at the end.
Is there a way to turn this off?
Here is a live demo
The Substack issue may be due to including header
in the unlikely candidates list. We should try removing it and see if it fixes this and other Substack parsing issues.
@yagudaev: Thank you for your contribution, but I think this is unrelated to the issue @uqs filed. Would you mind opening a new issue, please?
@cmkm got it, I'll open a new issue 😊.
Created a new issue here: https://github.com/mozilla/readability/issues/863 -- cleaned up the demo and made the description more clear
had a quick look at this. removing "header" from unlikely does prevent the h1
from being culled but it gets removed later on due to low class weight, probably because the class name is header-with-anchor-widget
and widget is listed in the negative
regex.
with header
removed from unlikely and header
added to positive
does ensure all four headers appear in the output properly. this also improves the parsing on quite a few of the other test cases as well. eg: readding the rubric as well as headings for "Pros", "Cons", and "Summary" on the Engadget review, definition terms from the Google SRE test case. it does introduce a couple of issues on other pages though, like adding back the site index section on NYT pages and a pair of duplicated headings on the Mercurial test case.
i'll push up a quick draft PR for review
Hi, pretty much every article on that substack is missing the headers when turning on reader mode.
<h1>
I. Foo, followed by some<p>
and then another<h1>
II. Bar. The headers are not turned into headings in reader mode.