Open andresriancho opened 10 years ago
This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.
Unfortunately, libdatrie (which is this wrapper using) doesn't really support unicode; it only supports mapping a subset of Unicode (alphabet you're passing) to a single-byte 0-255 range.
So, to fix that the following is needed:
1) make validation better; 2) add notes to README about how alphabet work; 3) maybe ditch libdatrie unicode support and do it ourselves - encode everything to utf8 and use a dummy 0-255 -> 0-255 mapping. The tricky part would be custom iteration because some chars will become multibyte.
I don't have plans to work on this in near future, so contributions are welcome.
What datrie version are you using? There was validation improvements in datrie 0.7.
I've taken a quick look at your code, and it seems you're building a trie at the startup and then just using it. If so, you may give https://github.com/kmike/DAWG a shot - it supports unicode properly, and it is faster and more memory efficient than datrie. The downside is that you can't modify DAWG after building.
This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.
Are you sure? I mean... those special characters are translated to multi-byte somewhere, and those bytes are always 0-255.
I had to change to [chr(x) for x in range(0, 255)]
because of this "false positive":
>>> from datrie import Trie
>>> import string
>>> m = Trie(string.printable)
>>> m[u'a'] = 1
>>> m[u'á'] = 1
>>> m[u'é'] = 1
>>> u'ì' in m
True
This looks like a similar/related bug.
So, to fix that the following is needed...
I believe 3) is the best option. Before sending anything to the library you simply encode('utf-8')
and that that's it. It would be at wrapper level and potentially fix all these issues.
What datrie version are you using? There was validation improvements in datrie 0.7.
Using datrie==0.6.1
you may give https://github.com/kmike/DAWG a shot
Will try it out, thanks for the link.
If nothing works, I might end up working on 3), does that make sense to you?
After some minor tests: DAWG won't be useful in my case since my values are instances, not bytes.
>>> class A():
... pass
...
>>> a = A()
>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', a)]
>>> bytes_dawg = dawg.BytesDAWG(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dawg.pyx", line 473, in dawg.BytesDAWG.__init__ (src/dawg.cpp:7904)
File "dawg.pyx", line 288, in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5795)
File "dawg.pyx", line 37, in dawg.DAWG.__init__ (src/dawg.cpp:1982)
File "dawg.pyx", line 472, in genexpr (src/dawg.cpp:7725)
TypeError: Expected bytes, got instance
>>>
Source required to do 3) is in C. Haven't touched that in ages. Won't be able to help.
Even better than (3) is to talk to libdatrie author and make its "unicode support" optional. Currently, alphabet is a range of unicode chars that datrie support. It doesn't use multibyte characters, it always encode char using a single byte. For other wrappers I usually encode/decode data using utf8, but with libdatrie it means extra work/slowdowns as encoding/decoding will happen twice (in a wrapper and in a library).
You can use IntDAWG with indices and store values in a list (this is what datrie doing for you). Another option is https://github.com/kmike/marisa-trie - it is slower, but it is even more memory efficient, and it gives you indices for free.
Running in the latest version of datrie
(from git clone), gives a different result:
>>> from datrie import Trie
>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
False
>>>
Note the last "False", which differs from the initially reported "True". This makes the datrie a little bit more consistent, but still a bug IMHO. At least it should raise an exception when trying to add something that afterwards it will tell you that it is not there.
Another option is https://github.com/kmike/marisa-trie
Oh man... you :love_letter: wrappers huh?
Here is one more for your pleasure: https://github.com/kmike/hat-trie :) I was looking for a perfect data structure at one point of my life..
I agree it should raise an exception. That said, it is documented that what you're trying to do won't work (see a warning here https://pypi.python.org/pypi/datrie/).
You can use IntDAWG with indices and store values in a list (this is what datrie doing for you)
Thanks for the hint, finally I ended up using the same idea but with IntCompletionDAWG
and works like a charm!
https://github.com/andresriancho/django-moth/commit/038419de9f31e7344c4a8e8ad5bcef5d4012b82f
Current unicode support is still buggy, https://github.com/pytries/datrie/issues/48
I would expect this output instead:
Using datrie==0.6.1