microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.33k stars 165 forks source link

[metascraper-description] Wrong description pulled from nbcnews.com articles #341

Closed jonpincus closed 3 years ago

jonpincus commented 3 years ago

Prerequisites

Subject of the issue

metascraper-description is picking up a description of a context from an ld-json fragment, rather than the description or og:description

Steps to reproduce

var description = require("metascraper-description");
var metascraper = require("metascraper")([description()]);
var fetch = require("node-fetch");
var url = "https://www.nbcnews.com/news/us-news/no-charges-filed-against-kenosha-officers-jacob-blake-shooting-n1252739";

fetch(url).then(res => res.text())
.then(html => metascraper({url,html}))
.then(result => console.log(result.description))

Expected behaviour

Should print either the meta description content

Rusten Sheskey, the police officer who shot Jacob Blake in the back on Aug. 23, will not face any criminal charges, Wisconsin prosecutors said Tuesday.

or the meta og:description content

Even before announcing his findings, the Kenosha County district attorney pleaded for peace.

Actual behaviour

Actually prints

This is additional taxonomy that helps us with analytics

Which appears to come from this ld-json fragment

{"@context":{"@vocab":"http://schema.org","pageType":{"@id":"Text","@type":"@id"},"vertical":{"@id":"Text","@type":"@id"},"subVertical":{"@id":"Text","@type":"@id"},"section":{"@id":"Text","@type":"@id"},"subSection":{"@id":"Text","@type":"@id"},"label":{"@id":"Text","@type":"@id"},"packageId":{"@id":"Text","@type":"@id"},"sponsor":{"@id":"Text","@type":"@id"},"ecommerceEnabled":{"@id":"Text","@type":"@id"},"videoPlayerCount":{"@id":"Text","@type":"@id"},"appVersion":{"@id":"Text","@type":"@id"},"tags":{"@id":"Text","@type":"@id"}},"@type":"Dataset","name":"additionalTaxonomy","description":"This is additional taxonomy that helps us with analytics","url":"https://www.nbcnews.com/news/us-news/no-charges-filed-against-kenosha-officers-jacob-blake-shooting-n1252739","pageType":"Article","vertical":"news","subVertical":"","section":"news","subSection":"us-news","label":"","packageId":"","sponsor":"","ecommerceEnabled":false,"videoPlayerCount":3,"appVersion":"5.17.0","tags":""}
jonpincus commented 3 years ago

As a workaround I'm currently trying to movetoDescription($jsonld('description')) after toDescription($ => $('meta[property="og:description"]').attr('content')) ... but I'm not sure whether this will break it on other sites!

Kikobeats commented 3 years ago

I think it's strongly related to https://github.com/microlinkhq/metascraper/pull/340

jonpincus commented 3 years ago

I think it's strongly related to #340

Seems plausible to me!