scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
851 stars 113 forks source link

JSONDecodeError: Extra data: line 21 column 1 (char 572) for URL https://lubelska.co.uk/ #143

Open advance512 opened 4 years ago

advance512 commented 4 years ago

Seems that the issue is that the JSON-LD document is:

// <![CDATA[
{
  "@context": "http:\/\/schema.org\/",
  "name": "Lubelska",
  "@type": "Organization",
  "logo": "https://lubelska.co.uk/wp/wp-content/uploads/2019/05/Lubelska-1.jpg",
  "url": "https://lubelska.co.uk/",
  "sameAs": [
    "https://twitter.com/EdwardHowey",
    "https://www.facebook.com/Lubelska-309144763268698/",
    "https://www.pinterest.co.uk/lubelskaltd/",
    "https://www.instagram.com/lubelska1/"
  ],
  "contactPoint": [{
    "@type": "ContactPoint",
    "telephone": "+44 20 3911 5526",
    "email": "info@lubelska.co.uk",
    "contactType": "sales"
  }]
}
// ]]&gt;

and after the replacing in jsonLd._extractItems():

            # sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
            data = json.loads(
                HTML_OR_JS_COMMENTLINE.sub('', script), strict=False)

it becomes:

{
  "@context": "http:\/\/schema.org\/",
  "name": "Lubelska",
  "@type": "Organization",
  "logo": "https://lubelska.co.uk/wp/wp-content/uploads/2019/05/Lubelska-1.jpg",
  "url": "https://lubelska.co.uk/",
  "sameAs": [
    "https://twitter.com/EdwardHowey",
    "https://www.facebook.com/Lubelska-309144763268698/",
    "https://www.pinterest.co.uk/lubelskaltd/",
    "https://www.instagram.com/lubelska1/"
  ],
  "contactPoint": [{
    "@type": "ContactPoint",
    "telephone": "+44 20 3911 5526",
    "email": "info@lubelska.co.uk",
    "contactType": "sales"
  }]
}
// ]]&gt;

and naturally this part which was not replaced:

// ]]&gt;

causes the error.

Vitiell0 commented 4 years ago

Having the same problem with this url: https://www.eatwell101.com/shrimp-and-broccoli-foil-packs-recipe

Which has this as the value for script after running HTML_OR_JS_COMMENTLINE

'\n{
"@context":"https:\\/\\/schema.org\\/",
"@type":"Recipe",
"mainEntityOfPage":{
"@type":"WebPage","
@id":"https:\\/\\/www.eatwell101.com\\/shrimp-and-broccoli-foil-packs-recipe"},
"name":"Baked Shrimp and Broccoli Foil Packs with Garlic Lemon Butter Sauce",
"url":"https:\\/\\/www.eatwell101.com\\/shrimp-and-broccoli-foil-packs-recipe",
"headline":"Baked Shrimp and Broccoli Foil Packs with Garlic Lemon Butter Sauce",
"Description":"This baked shrimp foil pack meal is ready in under 30 minutes - The easiest way to cook shrimp in your oven!",
"author":{
"@type":"Person",
"name":"Christina Cherrier"},
"image":"https:\\/\\/www.eatwell101.com\\/wp-content\\/uploads\\/2019\\/04\\/shrimp-and-broccoli-recipe-2.jpg",
"datePublished":"2020-01-10 07:47:21",
"dateModified":"2020-06-20 17:47:39",
"Publisher":"Eatwell101",
"ingredients":"",
"prepTime":"PT10M",
"cookTime":"PT15M",
"recipeYield":"2 servings"}
// ]]>\n'

so same problem where // ]]>\n' was not replaced correctly

Vitiell0 commented 4 years ago

Just opened a PR with a fix here: https://github.com/scrapinghub/extruct/pull/144