values() bug with unicode characters

andresriancho commented 10 years ago

>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
True

I would expect this output instead:

...
>>> m.values(u'core/encoding_utf8')
[1, 2]
...

Using datrie==0.6.1

kmike commented 10 years ago

This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.

Unfortunately, libdatrie (which is this wrapper using) doesn't really support unicode; it only supports mapping a subset of Unicode (alphabet you're passing) to a single-byte 0-255 range.

So, to fix that the following is needed:

1) make validation better; 2) add notes to README about how alphabet work; 3) maybe ditch libdatrie unicode support and do it ourselves - encode everything to utf8 and use a dummy 0-255 -> 0-255 mapping. The tricky part would be custom iteration because some chars will become multibyte.

I don't have plans to work on this in near future, so contributions are welcome.

What datrie version are you using? There was validation improvements in datrie 0.7.

kmike commented 10 years ago

I've taken a quick look at your code, and it seems you're building a trie at the startup and then just using it. If so, you may give https://github.com/kmike/DAWG a shot - it supports unicode properly, and it is faster and more memory efficient than datrie. The downside is that you can't modify DAWG after building.

andresriancho commented 10 years ago

This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.

Are you sure? I mean... those special characters are translated to multi-byte somewhere, and those bytes are always 0-255.

I had to change to [chr(x) for x in range(0, 255)] because of this "false positive":

>>> from datrie import Trie
>>> import string
>>> m = Trie(string.printable)
>>> m[u'a'] = 1
>>> m[u'á'] = 1
>>> m[u'é'] = 1
>>> u'ì' in m
True

This looks like a similar/related bug.

So, to fix that the following is needed...

I believe 3) is the best option. Before sending anything to the library you simply encode('utf-8') and that that's it. It would be at wrapper level and potentially fix all these issues.

What datrie version are you using? There was validation improvements in datrie 0.7.

Using datrie==0.6.1

andresriancho commented 10 years ago

you may give https://github.com/kmike/DAWG a shot

Will try it out, thanks for the link.

If nothing works, I might end up working on 3), does that make sense to you?

andresriancho commented 10 years ago

After some minor tests: DAWG won't be useful in my case since my values are instances, not bytes.

>>> class A():
...     pass
... 
>>> a = A()
>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', a)]
>>> bytes_dawg = dawg.BytesDAWG(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dawg.pyx", line 473, in dawg.BytesDAWG.__init__ (src/dawg.cpp:7904)
  File "dawg.pyx", line 288, in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5795)
  File "dawg.pyx", line 37, in dawg.DAWG.__init__ (src/dawg.cpp:1982)
  File "dawg.pyx", line 472, in genexpr (src/dawg.cpp:7725)
TypeError: Expected bytes, got instance
>>>

andresriancho commented 10 years ago

Source required to do 3) is in C. Haven't touched that in ages. Won't be able to help.

kmike commented 10 years ago

Even better than (3) is to talk to libdatrie author and make its "unicode support" optional. Currently, alphabet is a range of unicode chars that datrie support. It doesn't use multibyte characters, it always encode char using a single byte. For other wrappers I usually encode/decode data using utf8, but with libdatrie it means extra work/slowdowns as encoding/decoding will happen twice (in a wrapper and in a library).

You can use IntDAWG with indices and store values in a list (this is what datrie doing for you). Another option is https://github.com/kmike/marisa-trie - it is slower, but it is even more memory efficient, and it gives you indices for free.

andresriancho commented 10 years ago

Running in the latest version of datrie (from git clone), gives a different result:

>>> from datrie import Trie
>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
False
>>>

Note the last "False", which differs from the initially reported "True". This makes the datrie a little bit more consistent, but still a bug IMHO. At least it should raise an exception when trying to add something that afterwards it will tell you that it is not there.

andresriancho commented 10 years ago

Another option is https://github.com/kmike/marisa-trie

Oh man... you :love_letter: wrappers huh?

kmike commented 10 years ago

Here is one more for your pleasure: https://github.com/kmike/hat-trie :) I was looking for a perfect data structure at one point of my life..

I agree it should raise an exception. That said, it is documented that what you're trying to do won't work (see a warning here https://pypi.python.org/pypi/datrie/).

andresriancho commented 10 years ago

You can use IntDAWG with indices and store values in a list (this is what datrie doing for you)

Thanks for the hint, finally I ended up using the same idea but with IntCompletionDAWG and works like a charm!

https://github.com/andresriancho/django-moth/commit/038419de9f31e7344c4a8e8ad5bcef5d4012b82f

eromoe commented 6 years ago

Current unicode support is still buggy, https://github.com/pytries/datrie/issues/48

pytries / datrie

values() bug with unicode characters #20