Closed MartinMalinda closed 7 months ago
Fix is merged. New version 4.0.0 available: https://www.npmjs.com/package/url-metadata
Thanks!
Hi @laurengarcia
I'm using v4.0.1 and testing on this url: https://www.youku.tv/v_nextstage/id_decaa6fcde074a59aa21.html
There is 2 jsonld objects but I only able to get the first one. Could you please check?
@anhthang
Google Rich Result Test also shows just one item, so it's probably syntax issue on their side
https://search.google.com/test/rich-results/result/r%2Fvideos?id=4R3t7Nw0mGhZhRFIvi_6Vw
Thanks for sharing this. Here's what i see on the schema.org validator, looks like our code is not accounting for the 2nd case where the script tag contains two items. Can fix it this weekend.
Ok, i see what's happening. There is a syntax error in the jsonld for the youku.tv example. You can verify it here: https://json-ld.org/playground/
In our code here, you can uncomment the console.log in the try/catch(e)
i added in /lib/extract-json-ld.js
and you'll get the error "Bad control character in string literal in JSON at position 1737 (line 12 column 1222)". The standard formatter I'm using in this package also points to several spots with "irregular whitespace" in the json-ld as well (use npm run format
in this package to test on the original markup), so i removed those as well.
That said, our code should be able to handle the @graph
syntax for jsonld, so i added support for it and added a test that includes this example without the syntax error and irregular whitespaces in the original markup (see /test/json-ld.test.js
).
Newest version 4.1.0 is published: https://www.npmjs.com/package/url-metadata
Hi @laurengarcia
I think it's not an issue with the whitespace as other metadata still parsed.
I tried to copy the missing jsonld to vscode and get below error at the end of description part. After removed the line break, no more warnings. FYI, hope this can help.
Yes, to be clear that's exactly what i saw. I could make that example url's json-ld parse properly if the one "unexpected end of string" was removed. The additional "irregular whitespaces" i removed as well because my formatter didn't like them, but the json-ld parsed just fine when they were there.
Yes, to be clear that's exactly what i saw. I could make that example url's json-ld parse properly if the one "unexpected end of string" was removed. The additional "irregular whitespaces" i removed as well because my formatter didn't like them, but the json-ld parsed just fine when they were there.
Is there anyway to fix that unexpected errors so we don't miss these data?
Been mulling this over. Not sure what the right thing to do is here. I think we should stick to what we have so that it matches the Google tool. If we do the work of trimming the bad space(s) or bad characters out then it is deceiving the person scraping the data that it works when in fact it does not. I'm open to counterarguments tho, i just think its deceptive at this time to strip the bad spaces/ characters out before returning the data.
If you target some specific site and you want to make it work despite syntax errors on their side, you can fetch the HTML and patch it before extracting the meta. Pass the patched HTML to url-metadata instead.
I'm looking at https://github.com/laurengarcia/url-metadata/blob/master/lib/extract-json-ld.js
Unless I'm reading it wrong, It seems like the
extracted
is being replaced with the last parsed JSON+LD, instead of returning an array of all JSON+LD infos?I'm testing on this URL: https://goout.net/cs/metronome-prague-2024/szpsfuw/
rich results tester detects two items: https://search.google.com/test/rich-results/result?id=6sn_Xcdp6zfbqC3lP9ywpg
And I'm getting back only one: