Open Shepard opened 2 years ago
Thanks for this! I think your proposed solution is right—the idea is that headers at the very beginning of the main content are likely not actually part of the content, but the way it's detected can easily lead to false positives. We'll see about making a change here.
This looks to still be an issue in April 2023. Using it on Hackaday.com/blog will strip everything in a header tag after the page title.
To be clear, all the individual article titles are being stripped off.
Still an issue as of Jan 2024. As @Overwatching pointed out, the subheadings inside the body of the articles are being removed.
Expected Behavior
Heading tags (H1, H2, H3, ...) that are part of the content should be kept, e.g. if they represent subheadings.
Current Behavior
Hx tags are removed if they appear before any P tags.
Steps to Reproduce
I noticed this when writing a custom extractor for the site spektrum.de so I'll include the extractor code I have so far.
I then opened a few articles on the website and run this with code in the context of the pages:
Some URLs to try this on:
Detailed Description
This cleaning happens in clean-headers.js. Going by the code comment it was meant to catch headlines that appear before any text paragraphs on the page. On the Spektrum website the article is split up into multiple DIVs each of which can contain Ps, H3s and other content. So you can have a H3 that is a subheading in the middle of the article but it is the first child inside of a DIV before any Ps in that DIV. The cleaning code will therefore remove it. I would like to keep those H3s.
Possible Solution
Honestly, this piece of code seems to be a bit too broad and aggressive in cleaning headers. Personally I would remove this part of the function entirely. But if you see merit in keeping it, perhaps making sure it only finds headers before Ps across the whole content (instead of checking on the DOM level of the header) would be closer to the original intention of the code.