Bad cut on the buffer cause UnicodeDecodeError

Psycojoker commented 11 years ago

Hello,

I've encounter a very rare bug on the version 1.1 of ijson. I've try to create a bunch of code to reproduce it, but I can only managed to make it work (more or less) on a debian stable so I'm going to describe it instead.

Here: https://github.com/isagalaev/ijson/blob/master/ijson/backends/python.py#L88 you try to do a .decode("Utf-8") on a partial read of a file. Which works for 99.9% of the case. But, Utf-8 chars can be store on more than one byte. In 0.1% of the cases, you read in the middle of an Utf-8 char, thuse you don't read the full char, therefor .decode("Utf-8") fails with a message like this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 16383: unexpected end of data

My current dirty hack to fix this is:

    except ValueError:
        old_len = len(self.buffer)
        data = self.f.read(BUFSIZE)
        try:
            self.buffer += data.decode('utf-8')
        except UnicodeDecodeError:
            data += self.f.read(1)
            self.buffer += data.decode('utf-8')
        if len(self.buffer) == old_len:
             raise common.IncompleteJSONError()

But this is not a good solution because UTF-8 chars can have an arbitrary len AFAIK. My intuition is that the good approach would be to only decode the content went you really need it, but I didn't managed to code this (tbh, I've only tried to do that for 10min).

Here is some code to reproduce this bug (but I've only managed to make this work on a stable debian where, apparently, the encoding is a bit fucked up): https://gist.github.com/Psycojoker/7225794 (you obviously need ijson 1.1 to make this work).

Kind regards and thanks a lot for ijson, it really helped me a lot :)

Psycojoker commented 11 years ago

It appears that it happens way more often in 500mb of data full of Unicode names. You can find a (dirty) solution that works for arbitrary len of UTF8 chars here: https://github.com/Psycojoker/ijson/commit/726f56d8199780b9f73264b2cdef5c1df32171e5

I didn't submit a pull request because I consider this solution to be a dirty hack and I think that this should be fix in a better way somewhere else if possible.

isagalaev commented 11 years ago

Oh, that's embarrassing :-(. Indeed, it is a problem and a very real one for, say, a JSON with a lot of text in Asian languages that translate to 3- or 4-byte sequences in UTF-8. I can't think of a good solution from the top of my head but I have an idea that might just work: all the syntactically significant characters are ASCII anyway so we can parse without decoding at all and decode only fully parsed strings. I shall look into it…

klemetsson commented 10 years ago

I too encounter this error very often when dealing with json data containing Japanese, Hindi and Russian characters.

Psycojoker commented 10 years ago

@isagalaev as horrible as it is, maybe you can merge my hack to avoid this bug for your users in the meantime :/ ? I can do a pull request if you'd like.

isagalaev commented 10 years ago

Yeah, that's probably a good temporary solution. Could you make it into a pull request?

Psycojoker commented 10 years ago

@isagalaev done https://github.com/isagalaev/ijson/pull/17

klemetsson commented 10 years ago

This solved my problems as well. Thanks!

klemetsson commented 10 years ago

Sorry, I was a bit premature :/

I still encounter errors on the other self.f.read(BUFSIZE).decode('utf-8') rows when dealing with characters from various non-latin alphabets.

I added this hack to solve it and it seems to work for me. Tested it with lots of different characters.

    def _read_(self, size):
        if self._buffer == None:
            self._buffer = self.f.read(size)
        ret = self._buffer
        self._buffer = self.f.read(size)
        while len(self._buffer) and ord(self._buffer[:1]) & 0xc0 == 0x80:
            ret += self._buffer[:1]
            self._buffer = self._buffer[1:]
        return ret

And then replaced all self.f.read(BUFSIZE).decode('utf-8') with self._read_(BUFSIZE).decode('utf-8') and added self._buffer = None in __iter__

It uses two read buffers and adds all unfinished bytes from the second buffer to the first. It should work with utf8 characters of any length.

Best regards

isagalaev commented 10 years ago

I've finally managed to get my hands on this issue, sorry for the delay. Turns out Python's module "codecs" already has a correct streaming utf-8 decoder which fixes this in an easy and fast way.

isagalaev commented 10 years ago

Also, credit where is due: thanks to Daniel Lescohier for making me aware of codecs.StreamReader.

I'm going to release a new version of ijson with this fix, so if anyone reading this could test it on their data that would be very handy! Thanks!

casperlehmann commented 10 years ago

The updated version works on my data. Thanks a lot for the fix.

isagalaev / ijson

Bad cut on the buffer cause UnicodeDecodeError #15