postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.45k stars 446 forks source link

Parsing lead_image_url when there are multiple og:image's present #592

Closed TLadd closed 2 years ago

TLadd commented 3 years ago

Expected Behavior

If a site has og:image set twice, it would choose one of them as the lead_image_url. Obviously having duplicate og:image's specified is a mistake but I would still like to handle parsing the image out in this scenario.

Current Behavior

It chooses neither of the images and ends up just choosing another image on the page

Steps to Reproduce

const MercuryParser = require("@postlight/mercury-parser");
const x = await MercuryParser.parse("https://www.realityblurred.com/realitytv/2017/08/ayto-season-six-host-terrence-j/"); // Any page with two `og:image`'s set
console.log(x.lead_image_url);

This prints https://www.realityblurred.com/realitytv/wp-content/themes/realityblurred/images/Andy-Dehnart.jpg, which is the first image in the body of the page. The page itself does have an identical og:image, but it is specified twice in the head:

<meta property="og:image" content="https://www.realityblurred.com/realitytv/images/2017/08/ayto-season-six-cast.jpg">

Detailed Description

I'm trying to get the lead image url out of the above page.

Possible Solution

If there are multiple og:image's present in a page, choose the first one.

johnholdun commented 2 years ago

Fixed by #696