postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.37k stars 443 forks source link

Is it intentional that dek elements need to be contained in the content? #676

Open Shepard opened 2 years ago

Shepard commented 2 years ago

Expected Behavior

When defining a custom extractor, elements selected via the selector for the "dek" can be found anywhere in the document.

Current Behavior

The selector only finds something if the dek element is included in whatever the content selectors returned after selecting and cleaning.

Steps to Reproduce

I noticed this when writing a custom extractor for the site spektrum.de so I'll include the extractor code I have so far.

import Mercury from '@postlight/mercury-parser';

const SpektrumExtractor = {
  domain: 'www.spektrum.de',

  title: {
    selectors: [
      '.content__title'
    ],
  },

  author: {
    selectors: [
      '.content__author__info__name'
    ],
  },

  date_published: {
    selectors: [
      '.content__meta__date'
    ],
  },

  dek: {
    selectors: [
      '.content__intro'
    ],
  },

  lead_image_url: {
    selectors: [
      ['meta[name="og:image"]', 'value'],
      ['meta[property="og:image"]', 'content'],
      '.image__article__top img',
    ],
  },

  content: {
    selectors: [
      'article.content'
    ],
    clean: [
      '.breadcrumbs',
      '.hide-for-print',
      'aside',
      'header',
      '.image__article__top',
      '.content__author',
      '.copyright',
      '.callout-box',
    ],
  },
}

Mercury.addExtractor(spektrumExtractor);

I then opened the article https://www.spektrum.de/news/genetik-das-geheimnis-der-parasitischen-rafflesien/2039026 and run this with code in the context of the page:

const result = await Mercury.parse(document.URL, {
    html: document.documentElement.outerHTML,
    fetchAllPages: false,
});
console.log(result.dek);

The console output will be null. If I adjust the selector 'header' for the content to 'header h2' then the dek element will be included in the content and can thus be found and will appear on the console.

Detailed Description

I'm writing a custom extractor and I noticed that the dek property was always null after parsing. All the other properties were working and an element matching the selector I had defined for the dek was clearly contained in the document. When debugging this I noticed that the reason it can not be found is that by the time the extraction code gets to the dek, the DOM the selector gets applied to is not the original document anymore but (from the looks of it) only what is left from it after extracting and cleaning the content property.

So, effectively, the dek has to be contained in the content in order to be found. I'm wondering if this is intentional. If so, I can adjust my selectors for the content to include the dek but I'd rather not have that bit in there. To me, the content should only be the main body of text.