Storing unicode keys returns bytestrings that must be decoded as utf8.

jquast commented 9 years ago

I know sqlite is utf-8 internally, that's fine.

But I end up with boiler plate code that says:

for dictionaries of keys of str, decode as utf-8 -- on retrieval, only.

This seems wrong. Shouldn't sqlitedict do this for us?

>>> mydict = sqlitedict.SqliteDict('/tmp/some_.db', autocommit=True)
>>> hamsterface = u'🐹'
>>> print hamsterface, repr(hamsterface), type(hamsterface)
🐹 u'\U0001f439' <type 'unicode'>
>>> mydict[hamsterface] = {hamsterface: hamsterface}
>>> print mydict.keys()[0], repr(mydict.keys()[0]), type(mydict.keys()[0])
🐹 '\xf0\x9f\x90\xb9' <type 'str'>
>>> ## Here, it is raw bytes, not unicode.  It still "prints" because my terminal happens to also be utf-8
>>> mydict.values()[0]
{u'\U0001f439': u'\U0001f439'}
>>> ## Yet, the value remains unicode. The only way to get the same unicode dict that we tried to represent back, is to decode it manually.
>>> decoded = { mydict.keys()[0].decode('utf-8'): mydict.values()[0] }
>>> decoded
{u'\U0001f439': {u'\U0001f439': u'\U0001f439'}}

If you agree, that the .keys() and .items() and so on should automatically decode str-type as utf-8 into unicode-type, then I will be happy to submit a PR.

piskvorky commented 9 years ago

Hmm, interesting.

I don't remember why/where the keys become utf8. But it certainly makes sense if what you get out of the dictionary is what you put in in the first place (unicode => unicode, rather than unicode => str, as you say).

I'd be a little careful with changes that break backward compatibility for str keys, but if you have a way of fixing the unicode keys only, that would be perfect. Does that make sense?

jquast commented 9 years ago

Makes sense. You're right we need to be very careful for compatibility. This is actually quite complicated in such respects. You're looking for:

although the key is a bytestring (any encoding), it was stored by a newer version of sqlitedict as one, so return as bytestring (same behavior as previous version).
although the key is bytestring, it was stored by a newer version of sqlitedict as unicode, so so return as unicode (decoded by utf-8). (new, proposed behavior).
although the key is a bytestring, it was saved by an earlier version that was not aware of its type (unknown encoding), so return as a bytestring (new behavior for backwards compatibility).

Thinking some more, downstream, I'm always explicitly decoding as utf-8, and such code would fail after the change: https://github.com/jquast/x84/blob/140f480b956cdadc742801e3ff57f23a34cf8816/x84/bbs/msgbase.py#L74 https://github.com/jquast/x84/blob/b1c2ef927af95214db3081fe8d3cebb71958e1a7/x84/default/tetris.py#L58 https://github.com/jquast/x84/blob/34cd65c1de666bcf5e9e50501d83d87367915ac6/x84/default/lc.py#L47 https://github.com/jquast/x84/blob/3ee80262a33a9ef2fb497c210c969574b6cb9223/x84/bbs/userbase.py#L16

With a version spec, existing databases would continue to succeed with such downstream workarounds (item 3), but if i were to instantiate new databases, and their version was implicitly set to 2, then it would fail ...

I've been in these situations a few times before, open source api libraries are the hardest... here are some suggestions.

A. provide an entirely new package

I think this is the only way not to cause anybody trouble, and new users of the library get the "unicode in, unicode out" behavior without having to research using any toggles or parameters to make it so.

Create a branch in sqlitedict for the current version "1.x". This will continue to publish as package sqlitedict. Bugfixes may be backported into this branch, and released as "sqlitedict".
master branch becomes a new package name, such as sqlitedict2.
Documents "this package supercedes sqlitedict"
setup.py register the old branch, with a new README heading that says "This package is superseded by , however it remains available for backwards compatibility"

This is nice, because the code change is minimal, it can simply naturally store by type, and return the same type. Easy to test and no magic about backwards compatibility.

B. do not break compatibility

Document a caveat: "Keys of type unicode are stored, but returned as utf-8 bytestrings. You must explicitly decode them as utf-8, or provide do_unicode_keys=True to return as a unicode. utf8-encoded bytestrings are always returned for unicode types by default for backwards compatibility with version X.Y."
Provide the implied do_unicode_keys=False by default argument for SqliteDict.
Encourage that this should be set True for new software.

C. break compatibility

I won't go into the details much more, because this one is probably the least likely. But basically document the change in Changelog, provide example error messages caused by the change which users might run into to help search engines find the cause, and the workaround (set do_unicode_keys=False or some such).

piskvorky commented 9 years ago

I'm -1 on A, +1 on B or C.

And if C: just mention the change in Changelog, bump up the version to 2.0.0 and call it a day (no new parameters). So that's basically A, but discerned by package version rather than new package name.

In the end, the choice between B and C is yours Jeff, since you're the one who found the issue. Whatever suits you best!

Storing the type means sqlitedict will suddenly need more memory for the same amount of keys, right?

piskvorky commented 9 years ago

Closing -- subsumed by #25.

piskvorky / sqlitedict

Storing unicode keys returns bytestrings that must be decoded as utf8. #19