postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.45k stars 445 forks source link

Duplicate meta entries --> fail #330

Closed black-puppydog closed 5 years ago

black-puppydog commented 5 years ago

I'm having trouble parsing attributes for this page:

https://cosmonaut.blog/2019/02/20/no-bernie/

This might very much be my non-existent JS/CSS skills, so feel free to close and sorry for the disturbance. The problem I have is with the lead_image_url selectors. The "default" (for most extractors) for this one would be [['meta[property="og:image"]', 'content']] or [['meta[name="twitter:image"]','value']], but both of those, when executed, return two near-identical entries, causing the whole thing to fall apart (because if I read the tutorial correctly, they'd need to return exactly one item).

The other idea would be to query the image directly from the page, using [['img.wp-post-image', 'src']], but this is an image with srcset and so the result ends up being a concatenation with multiple URLs (each of which would be acceptable to me) which I cannot further process in the simple selector: [...] setting.

Am I missing something here?

toufic-m commented 5 years ago

Indeed, a selector must return only one match, and there are a couple of ways to handle this:

  lead_image_url: {
    selectors: [
      ['meta[name="og:image"] ~ meta[name="og:image"]', 'value'], // this basically means: select the `meta[name="og:image"]` that is a subsequent sibling of a `meta[name="og:image"]`
      ['meta[name="og:image"]', 'value'], // if the first selector no longer works, then this meta property no longer has a duplicate and we can safely select the first one
    ],
  },
black-puppydog commented 5 years ago

I just tried the second approach and it works, thank you very much, also for taking the time to explain it. :)

Just for clarification: I am already using master, so #312 is already in my local sources. Yet if I query with [['img.wp-post-image', 'src']] I still get the concatenation. Could it be that #312 only changes the URLs after I already extracted the lead_image_url, or should I query differently?

toufic-m commented 5 years ago

That's awesome, no problem! When you're using master locally, you need to create your own local build (by following the documented steps), so that the changes that got merged into master (post-v2.0.0-release) are compiled into the distributable files. Could that be the issue for you now?

black-puppydog commented 5 years ago

You mean these steps? https://github.com/postlight/mercury-parser/blob/master/CONTRIBUTING.md#building

Yes, that's what I've been doing. JS really is a different world. I figured if I am executing code that I've written while master is checked out, then I must be running master code overall...? :P

black-puppydog commented 5 years ago

sorry, forgot to close this. thanks again!