Open svenwiegand opened 5 years ago
Hi @svenwiegand,
I added some notes to your PR (https://github.com/postlight/mercury-parser/pull/485)
A couple recommendations:
h1
vs header h1
. Generally speaking, this will help make the custom extractor less fragile. For content selectors, sticking with section
perhaps may be the better way to go.og:image
and see if that works for you. Hi @mtashley,
thanks for your response and your review of the PR.
Im okay with suggestion 1), though for me the explicit selection of header h1
seemed more safe. The page structure of "www.gruene.de"" isn't well designed overall and so I wanted to ensure only to get the h1
in the header in case there would be pages with additional h1
s.
I do not agree regarding 2). According to your documentation of content selectors a nested array behaves completely different to multiple selectors in a simple array. In this case I explicitly want the behavior you call "multi-match selection" in your documentation. A simple array does not provide the results I want.
Regarding 3): I would be happy to select og:image
, but unfortunately that doesn't work either. I don't have a clue why. I tried the following:
lead_image_url: {
selectors: [['meta[property="og:image"]', 'content']],
},
But it also delivers null
, though the selector delivers the expected result in the browser's dev tools.
At first I thought I do not have access to elements inside head
, but I can select the page's title
-element without problems, so something else seems to be wrong here.
And in the end the question remains, why the image I originally tried to select is in the input file and in the output file, but cannot be selected as lead_image_url
as soon as I start to use "multi-match selection" for the content.
master
Description
I have created a custom extractor for
https://www.gruene.de/themen/arbeit
which delivers the expected results forcontent
andlead_image_url
with the following specifications:Lets now switch to a multi match content selector:
Now the
lead_image_url
is alwaysnull
, though the multi match content selector contains the element with the relevant<image>
tag and the image is contained under the specified path in the input and in the output and can be selected with the same selector in chrome dev tools.BTW: Though your overall documentation and tools for creating custom extractors are great it would be helpful if the documentation could outline in which order transformations are performed and on which transformation result specific selectors are applied.