Open shiquanwang opened 6 years ago
I guess this provides more motivation for https://github.com/scrapinghub/extruct/pull/69/, though I'd prefer json decoding function to be an argument, not a global option.
Providing something which handles more cases by default makes sense to me, though we may start just with having a good example in README.
Maybe other libraries like demjson or yajl can handle it (see http://deron.meranda.us/python/demjson/demjson-2.2.4/docs/demjson.html#-decode - it seems there is an option to return data after the error).
Updated JSON-Ld can autocorrect badly formatted JSON.
Some web pages contain badly formatted JSON-LD data, e.g., an example
The JSON-LD in this page is:
In the JSON-LD above, the last
}
is extra. Andextruct
orjson.loads
won't handle it properly.The
json.loads
in Python after 3.5 will give detailed error information asJSONDecodeError: Extra data: line 19 column 1 (char 624)
The
error.msg
anderror.pos
can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:There're many possible format errors and some can be fixed easily some might be harder or even impossible.
I propose 3 ways to improve the situation:
extruct
try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error infoextruct
allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error typesextruct
can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error typesI personally recommend the latter 2 ways.
Thanks.