Closed Psycojoker closed 10 years ago
It appears that it happens way more often in 500mb of data full of Unicode names. You can find a (dirty) solution that works for arbitrary len of UTF8 chars here: https://github.com/Psycojoker/ijson/commit/726f56d8199780b9f73264b2cdef5c1df32171e5
I didn't submit a pull request because I consider this solution to be a dirty hack and I think that this should be fix in a better way somewhere else if possible.
Oh, that's embarrassing :-(. Indeed, it is a problem and a very real one for, say, a JSON with a lot of text in Asian languages that translate to 3- or 4-byte sequences in UTF-8. I can't think of a good solution from the top of my head but I have an idea that might just work: all the syntactically significant characters are ASCII anyway so we can parse without decoding at all and decode only fully parsed strings. I shall look into it…
I too encounter this error very often when dealing with json data containing Japanese, Hindi and Russian characters.
@isagalaev as horrible as it is, maybe you can merge my hack to avoid this bug for your users in the meantime :/ ? I can do a pull request if you'd like.
Yeah, that's probably a good temporary solution. Could you make it into a pull request?
@isagalaev done https://github.com/isagalaev/ijson/pull/17
This solved my problems as well. Thanks!
Sorry, I was a bit premature :/
I still encounter errors on the other self.f.read(BUFSIZE).decode('utf-8')
rows when dealing with characters from various non-latin alphabets.
I added this hack to solve it and it seems to work for me. Tested it with lots of different characters.
def _read_(self, size):
if self._buffer == None:
self._buffer = self.f.read(size)
ret = self._buffer
self._buffer = self.f.read(size)
while len(self._buffer) and ord(self._buffer[:1]) & 0xc0 == 0x80:
ret += self._buffer[:1]
self._buffer = self._buffer[1:]
return ret
And then replaced all self.f.read(BUFSIZE).decode('utf-8')
with self._read_(BUFSIZE).decode('utf-8')
and added self._buffer = None
in __iter__
It uses two read buffers and adds all unfinished bytes from the second buffer to the first. It should work with utf8 characters of any length.
Best regards
I've finally managed to get my hands on this issue, sorry for the delay. Turns out Python's module "codecs" already has a correct streaming utf-8 decoder which fixes this in an easy and fast way.
Also, credit where is due: thanks to Daniel Lescohier for making me aware of codecs.StreamReader.
I'm going to release a new version of ijson with this fix, so if anyone reading this could test it on their data that would be very handy! Thanks!
The updated version works on my data. Thanks a lot for the fix.
Hello,
I've encounter a very rare bug on the version 1.1 of ijson. I've try to create a bunch of code to reproduce it, but I can only managed to make it work (more or less) on a debian stable so I'm going to describe it instead.
Here: https://github.com/isagalaev/ijson/blob/master/ijson/backends/python.py#L88 you try to do a .decode("Utf-8") on a partial read of a file. Which works for 99.9% of the case. But, Utf-8 chars can be store on more than one byte. In 0.1% of the cases, you read in the middle of an Utf-8 char, thuse you don't read the full char, therefor .decode("Utf-8") fails with a message like this:
My current dirty hack to fix this is:
But this is not a good solution because UTF-8 chars can have an arbitrary len AFAIK. My intuition is that the good approach would be to only decode the content went you really need it, but I didn't managed to code this (tbh, I've only tried to do that for 10min).
Here is some code to reproduce this bug (but I've only managed to make this work on a stable debian where, apparently, the encoding is a bit fucked up): https://gist.github.com/Psycojoker/7225794 (you obviously need ijson 1.1 to make this work).
Kind regards and thanks a lot for ijson, it really helped me a lot :)