piskvorky / sqlitedict

Persistent dict, backed by sqlite3 and pickle, multithread-safe.
Apache License 2.0
1.17k stars 131 forks source link

Modify for UnicodeDecodeError in load the multi-byte character value… #76

Closed tapdo closed 4 years ago

tapdo commented 6 years ago

… created by Python 2

menshikh-iv commented 6 years ago

Hello @tapdo, what's a motivation for this change? Also, need to add a test for this case.

piskvorky commented 6 years ago

Needs a clear motivating message with a description of the problem and solution.

Please add a unit test that fails before this change and succeeds after this change.

tapdo commented 6 years ago

The problem is "UnicodeDecodeError" occurred on python3 when reading database values include multi-byte character which are stored on python 2 environment. Because of the difference between python 3 & python 2, load function in pickle failed in this situation. The solution is additional argument of load function "encoding='bytes'" to avoid decoding value as string. Sorry I don't know how to create a test for python version issue. Only I confirmed good working in my environment both python 2 and 3.

menshikh-iv commented 6 years ago

@tapdo can you provide at least small example (how to reproduce this error)?

tapdo commented 6 years ago

unicode_fail_sample.zip My small sample. Please run on python 3 to reproduce error. Once you remove "my_db.sqlite", after that you can run a script on python 2 to create "my_db.sqlite" include this issue. I confirmed on python 2.7.13 and 3.6.1 , just for the record.

menshikh-iv commented 6 years ago

I checked it on 3.5 and 3.6 - all works correctly (with provided DB), for 2.7 this doesn't work (because in 2.7 max pickle protocol is 2 (but used 4))

If I re-create DB with 2.7 and after try to use it in 3.* - I see error like

Traceback (most recent call last):
  File "unicode_fail_sample.py", line 11, in <module>
    print("Load:" + mydict['tapdo'].decode('utf-8'))
  File "/home/ivan/.virtualenvs/abc3.5/lib/python3.5/site-packages/sqlitedict.py", line 246, in __getitem__
    return self.decode(item[0])
  File "/home/ivan/.virtualenvs/abc3.5/lib/python3.5/site-packages/sqlitedict.py", line 105, in decode
    return loads(bytes(obj))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
tapdo commented 6 years ago

@menshikh-iv Sorry for my mistake. The last part you show is my pointing it out. I attached the correct sample. unicode_fail_sample2.zip

menshikh-iv commented 6 years ago

Code:


from sqlitedict import SqliteDict

mydict = SqliteDict('./my_db.sqlite', autocommit=True)

if 'tapdo' not in mydict:
    mydict['tapdo'] = u"○".encode('utf-8')
else:
    print("Load:" + mydict['tapdo'].decode('utf-8'))

mydict.close()

Problems:

@tapdo first mentioned problem is your case, am I right?

tapdo commented 6 years ago

@menshikh-iv Yes, you right.

menshikh-iv commented 6 years ago

@tapdo OK, thanks. @piskvorky what's should be expected behavior for this case (this is a bug or not)?

piskvorky commented 6 years ago

I don't remember if sqlitedict committed to cross-Python compatibility officially, so I don't know if it's a bug.

But to me compatibility in both directions sounds like a good idea (we don't need any new pickle protocols). Any reason why that shouldn't work?

menshikh-iv commented 6 years ago

@piskvorky About pickle: we can pin 2 pickle version, but we should be very careful with it (pin for all new DB, for old - check the version and use it instead of 2).

About current encoding - I'm worried about testing this change (looks like we can easily break something) + probably, current PR will decrease performance significantly (because of loads -> rase/catch -> loads), need to benchmark it too.

piskvorky commented 6 years ago

I think pickle can recognize its version automatically during load, no action needed. Or what do you mean?

I agree that try/catch is not a good solution for checking API versions. We want to make the compatibility uniform (no exceptions) and explicit.

menshikh-iv commented 6 years ago

@piskvorky pickle can recognize automatically, but we shouldn't pin version for old DB where version != 2.

piskvorky commented 6 years ago

I still don't understand -- pin what, where, why?

menshikh-iv commented 6 years ago

@piskvorky pin pickle protocol version to 2 for cross-compatibility between py2 and py3

piskvorky commented 6 years ago

AFAIK pickle recognizes its protocol automatically on load. What actions on existing DBs do you think are needed?

menshikh-iv commented 6 years ago

@piskvorky I'm talking about saving now (not loading). For existing DB - at least extract pickle version and use same (for saving).

mpenkov commented 5 years ago

Looks like I'm late to the party here. How should we proceed with this PR?

If we want to pursue it, then we need to:

This requires additional effort. Personally, my recommendation is to abandon the PR, and explicitly state that we do not support cross-compatibility. My motivation is that Py2 will be obsolete soon, and we can direct our efforts in more productive directions elsewhere.

mpenkov commented 4 years ago

Closing due to inactivity.