laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 44 forks source link

Multiple JSON LD objects? #83

Closed MartinMalinda closed 7 months ago

MartinMalinda commented 7 months ago

I'm looking at https://github.com/laurengarcia/url-metadata/blob/master/lib/extract-json-ld.js

Unless I'm reading it wrong, It seems like the extracted is being replaced with the last parsed JSON+LD, instead of returning an array of all JSON+LD infos?

I'm testing on this URL: https://goout.net/cs/metronome-prague-2024/szpsfuw/

rich results tester detects two items: https://search.google.com/test/rich-results/result?id=6sn_Xcdp6zfbqC3lP9ywpg

And I'm getting back only one:

{
"jsonLd": 
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "item": {
          "@id": "https://goout.net/cs/",
          "name": "Domů"
        }
      }
    ]
  }
}
laurengarcia commented 7 months ago

Fix is merged. New version 4.0.0 available: https://www.npmjs.com/package/url-metadata

Thanks!

anhthang commented 7 months ago

Hi @laurengarcia

I'm using v4.0.1 and testing on this url: https://www.youku.tv/v_nextstage/id_decaa6fcde074a59aa21.html

There is 2 jsonld objects but I only able to get the first one. Could you please check?

image
MartinMalinda commented 7 months ago

@anhthang

Google Rich Result Test also shows just one item, so it's probably syntax issue on their side

https://search.google.com/test/rich-results/result/r%2Fvideos?id=4R3t7Nw0mGhZhRFIvi_6Vw

Screenshot 2024-03-15 at 9 13 20
laurengarcia commented 7 months ago

Thanks for sharing this. Here's what i see on the schema.org validator, looks like our code is not accounting for the 2nd case where the script tag contains two items. Can fix it this weekend.

schema org
laurengarcia commented 7 months ago

Ok, i see what's happening. There is a syntax error in the jsonld for the youku.tv example. You can verify it here: https://json-ld.org/playground/

In our code here, you can uncomment the console.log in the try/catch(e) i added in /lib/extract-json-ld.js and you'll get the error "Bad control character in string literal in JSON at position 1737 (line 12 column 1222)". The standard formatter I'm using in this package also points to several spots with "irregular whitespace" in the json-ld as well (use npm run format in this package to test on the original markup), so i removed those as well.

That said, our code should be able to handle the @graph syntax for jsonld, so i added support for it and added a test that includes this example without the syntax error and irregular whitespaces in the original markup (see /test/json-ld.test.js).

Newest version 4.1.0 is published: https://www.npmjs.com/package/url-metadata

anhthang commented 7 months ago

Hi @laurengarcia

I think it's not an issue with the whitespace as other metadata still parsed.

I tried to copy the missing jsonld to vscode and get below error at the end of description part. After removed the line break, no more warnings. FYI, hope this can help.

image
laurengarcia commented 7 months ago

Yes, to be clear that's exactly what i saw. I could make that example url's json-ld parse properly if the one "unexpected end of string" was removed. The additional "irregular whitespaces" i removed as well because my formatter didn't like them, but the json-ld parsed just fine when they were there.

anhthang commented 7 months ago

Yes, to be clear that's exactly what i saw. I could make that example url's json-ld parse properly if the one "unexpected end of string" was removed. The additional "irregular whitespaces" i removed as well because my formatter didn't like them, but the json-ld parsed just fine when they were there.

Is there anyway to fix that unexpected errors so we don't miss these data?

laurengarcia commented 7 months ago

Been mulling this over. Not sure what the right thing to do is here. I think we should stick to what we have so that it matches the Google tool. If we do the work of trimming the bad space(s) or bad characters out then it is deceiving the person scraping the data that it works when in fact it does not. I'm open to counterarguments tho, i just think its deceptive at this time to strip the bad spaces/ characters out before returning the data.

MartinMalinda commented 7 months ago

If you target some specific site and you want to make it work despite syntax errors on their side, you can fetch the HTML and patch it before extracting the meta. Pass the patched HTML to url-metadata instead.