postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.37k stars 443 forks source link

Header cleaning is a bit too aggressive #675

Open Shepard opened 2 years ago

Shepard commented 2 years ago

Expected Behavior

Heading tags (H1, H2, H3, ...) that are part of the content should be kept, e.g. if they represent subheadings.

Current Behavior

Hx tags are removed if they appear before any P tags.

Steps to Reproduce

I noticed this when writing a custom extractor for the site spektrum.de so I'll include the extractor code I have so far.

import Mercury from '@postlight/mercury-parser';

const spektrumExtractor = {
    domain: 'www.spektrum.de',
    title: {
        selectors: ['.content__title'],
    },
    author: {
        selectors: ['.content__author__info__name'],
    },
    content: {
        selectors: ['article.content'],
        clean: [
            '.breadcrumbs',
            '.hide-for-print',
            'aside',
            'header',
            '.image__article__top',
            '.content__author',
            '.copyright',
            '.callout-box',
        ],
    },
    date_published: {
        selectors: ['.content__meta__date'],
    },
    lead_image_url: {
        selectors: [['meta[property="og:image"]', 'content'], '.image__article__top img'],
    },
    dek: {
        selectors: ['.content__intro'],
    },
};

Mercury.addExtractor(spektrumExtractor);

I then opened a few articles on the website and run this with code in the context of the pages:

const result = await Mercury.parse(document.URL, {
    html: document.documentElement.outerHTML,
    fetchAllPages: false,
});
console.log(result.content);

Some URLs to try this on:

Detailed Description

This cleaning happens in clean-headers.js. Going by the code comment it was meant to catch headlines that appear before any text paragraphs on the page. On the Spektrum website the article is split up into multiple DIVs each of which can contain Ps, H3s and other content. So you can have a H3 that is a subheading in the middle of the article but it is the first child inside of a DIV before any Ps in that DIV. The cleaning code will therefore remove it. I would like to keep those H3s.

Possible Solution

Honestly, this piece of code seems to be a bit too broad and aggressive in cleaning headers. Personally I would remove this part of the function entirely. But if you see merit in keeping it, perhaps making sure it only finds headers before Ps across the whole content (instead of checking on the DOM level of the header) would be closer to the original intention of the code.

johnholdun commented 2 years ago

Thanks for this! I think your proposed solution is right—the idea is that headers at the very beginning of the main content are likely not actually part of the content, but the way it's detected can easily lead to false positives. We'll see about making a change here.

Overwatching commented 1 year ago

This looks to still be an issue in April 2023. Using it on Hackaday.com/blog will strip everything in a header tag after the page title.

To be clear, all the individual article titles are being stripped off.

Piny2u commented 6 months ago

Still an issue as of Jan 2024. As @Overwatching pointed out, the subheadings inside the body of the articles are being removed.