postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

Custom extractor: `lead_image_url` selector no longer working when using multi match selection for content #487

Open svenwiegand opened 5 years ago

svenwiegand commented 5 years ago

Description

I have created a custom extractor for https://www.gruene.de/themen/arbeit which delivers the expected results for content and lead_image_url with the following specifications:

export const WwwGrueneDeExtractor = {
  domain: 'www.gruene.de',

  title: {
    selectors: ['header h1'],
  },
  lead_image_url: { selectors: [['header img', 'src']], },
  content: { selector: [ 'section' ] }
}

Lets now switch to a multi match content selector:

  content: { selectors: [['section header', 'section h2', 'section p', 'section ol']] }

Now the lead_image_url is always null, though the multi match content selector contains the element with the relevant <image> tag and the image is contained under the specified path in the input and in the output and can be selected with the same selector in chrome dev tools.

BTW: Though your overall documentation and tools for creating custom extractors are great it would be helpful if the documentation could outline in which order transformations are performed and on which transformation result specific selectors are applied.

mtashley commented 5 years ago

Hi @svenwiegand,

I added some notes to your PR (https://github.com/postlight/mercury-parser/pull/485)

A couple recommendations:

  1. Aim for more general css selectors i.e. h1 vs header h1. Generally speaking, this will help make the custom extractor less fragile. For content selectors, sticking with section perhaps may be the better way to go.
  2. When using multiple selectors, only a single array is needed (you have a nested array above).
  3. While I wasn't able to replicate the behavior, you might also try targeting the image via meta tags og:image and see if that works for you.
svenwiegand commented 5 years ago

Hi @mtashley,

thanks for your response and your review of the PR.

Im okay with suggestion 1), though for me the explicit selection of header h1 seemed more safe. The page structure of "www.gruene.de"" isn't well designed overall and so I wanted to ensure only to get the h1 in the header in case there would be pages with additional h1s.

I do not agree regarding 2). According to your documentation of content selectors a nested array behaves completely different to multiple selectors in a simple array. In this case I explicitly want the behavior you call "multi-match selection" in your documentation. A simple array does not provide the results I want.

Regarding 3): I would be happy to select og:image, but unfortunately that doesn't work either. I don't have a clue why. I tried the following:

lead_image_url: {
  selectors: [['meta[property="og:image"]', 'content']],
},

But it also delivers null, though the selector delivers the expected result in the browser's dev tools.

At first I thought I do not have access to elements inside head, but I can select the page's title-element without problems, so something else seems to be wrong here.

And in the end the question remains, why the image I originally tried to select is in the input file and in the output file, but cannot be selected as lead_image_url as soon as I start to use "multi-match selection" for the content.