isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

Ijson support for unicode #69

Closed latha19jan closed 4 years ago

latha19jan commented 6 years ago

Characters such as ö,™ when parsing throws UnexpectedSymbol. Can somebody help on this issue.Will ijson support unicode characters

akaIDIOT commented 6 years ago

ijson does support unicode, could your issue have something to do with the encoding of what you're trying to parse?

UTF-8 bytes work:

>>> data = BytesIO('{"key": "vålue™"}'.encode('utf-8'))
>>> list(ijson.parse(data))
[('', 'start_map', None),
 ('', 'map_key', 'key'),
 ('key', 'string', 'vålue™'),
 ('', 'end_map', None)]

UTF-16LE bytes don't:

>>> data = BytesIO('{"key": "vålue™"}'.encode('utf-16le'))
>>> list(ijson.parse(data))
UnicodeDecodeError

strings / file (encoded as UTF-8) opened in text or binary mode work too:

>>> list(ijson.parse(open('/tmp/test.json')))
[('', 'start_map', None),
 ('', 'map_key', 'key'),
 ('key', 'string', 'vålue™'),
 ('', 'end_map', None)]

>>> list(ijson.parse(open('/tmp/bla.json', 'rt')))
[('', 'start_map', None),
 ('', 'map_key', 'key'),
 ('key', 'string', 'vålue™'),
 ('', 'end_map', None)]

>>> list(ijson.parse(open('/tmp/bla.json', 'rb')))
[('', 'start_map', None),
 ('', 'map_key', 'key'),
 ('key', 'string', 'vålue™'),
 ('', 'end_map', None)]