Closed black-puppydog closed 5 years ago
Indeed, a selector must return only one match, and there are a couple of ways to handle this:
your idea of querying the image directly from the page is perfectly correct, and the srcset
issue that you have mentioned has been addressed in #312 , which has been merged into master and should be included in the next package release;
alternatively, and in other situations where a non-unique selector doesn't exist, you can use a selector that accounts for the two matches by having it return the second match, while adding a fallback selector to match the first element in case the website's HTML is changed to no longer have duplicate tags; so it could be something like:
lead_image_url: {
selectors: [
['meta[name="og:image"] ~ meta[name="og:image"]', 'value'], // this basically means: select the `meta[name="og:image"]` that is a subsequent sibling of a `meta[name="og:image"]`
['meta[name="og:image"]', 'value'], // if the first selector no longer works, then this meta property no longer has a duplicate and we can safely select the first one
],
},
I just tried the second approach and it works, thank you very much, also for taking the time to explain it. :)
Just for clarification: I am already using master
, so #312 is already in my local sources. Yet if I query with [['img.wp-post-image', 'src']]
I still get the concatenation. Could it be that #312 only changes the URLs after I already extracted the lead_image_url
, or should I query differently?
That's awesome, no problem!
When you're using master
locally, you need to create your own local build (by following the documented steps), so that the changes that got merged into master (post-v2.0.0
-release) are compiled into the distributable files. Could that be the issue for you now?
You mean these steps? https://github.com/postlight/mercury-parser/blob/master/CONTRIBUTING.md#building
Yes, that's what I've been doing. JS really is a different world. I figured if I am executing code that I've written while master is checked out, then I must be running master code overall...? :P
sorry, forgot to close this. thanks again!
I'm having trouble parsing attributes for this page:
https://cosmonaut.blog/2019/02/20/no-bernie/
This might very much be my non-existent JS/CSS skills, so feel free to close and sorry for the disturbance. The problem I have is with the
lead_image_url
selectors. The "default" (for most extractors) for this one would be[['meta[property="og:image"]', 'content']]
or[['meta[name="twitter:image"]','value']]
, but both of those, when executed, return two near-identical entries, causing the whole thing to fall apart (because if I read the tutorial correctly, they'd need to return exactly one item).The other idea would be to query the image directly from the page, using
[['img.wp-post-image', 'src']]
, but this is an image withsrcset
and so the result ends up being a concatenation with multiple URLs (each of which would be acceptable to me) which I cannot further process in the simpleselector: [...]
setting.Am I missing something here?
Linux my-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
2a3ade706dc445ecb09cce552b087c850d2cb817
)