scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
847 stars 113 forks source link

Handle badly formatted JSON-LD data. #87

Open shiquanwang opened 6 years ago

shiquanwang commented 6 years ago

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
        "brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                    "@type": "Offer",
                    "priceCurrency": "GBP",
                    "price": "285.00",
                    "itemCondition": "http://schema.org/NewCondition",
                    "availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

I personally recommend the latter 2 ways.

Thanks.

kmike commented 6 years ago

I guess this provides more motivation for https://github.com/scrapinghub/extruct/pull/69/, though I'd prefer json decoding function to be an argument, not a global option.

Providing something which handles more cases by default makes sense to me, though we may start just with having a good example in README.

kmike commented 6 years ago

Maybe other libraries like demjson or yajl can handle it (see http://deron.meranda.us/python/demjson/demjson-2.2.4/docs/demjson.html#-decode - it seems there is an option to return data after the error).

gaurav19063 commented 4 years ago

Updated JSON-Ld can autocorrect badly formatted JSON.