pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Can't get value using key with null char (marisa_trie.Trie) #47

Open fvarrebola opened 6 years ago

fvarrebola commented 6 years ago

I might be missing something out here, but I believe there is a consistency issue with the Trie implementation and keys that have null characters.

Take a look at the following code snippet:

key = 'Random\x00Key'
python_dict = { key : 'random_value' }
key in python_dict # prints True
python_dict.get(key) # returns 'random_value'

std_trie = marisa_trie.Trie(python_dict)
key in std_trie # prints True
std_trie.keys() # prints ['Random\x00Key']
std_trie.get(key) # should return 'random_value'
std_trie.key_id(key) # should return the key id

What happens is that std_trie.get(key) actually returns None, and std_trie.key_id(key) throws a KeyError exception with the following trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/marisa_trie.pyx", line 409, in marisa_trie.Trie.key_id
  File "src/marisa_trie.pyx", line 417, in marisa_trie.Trie.key_id
KeyError: 'Random\x00Key'

Apparently, the RecordTrie implementation is immune to this consistency issue.

r_trie = marisa_trie.RecordTrie('<H', zip([key], [(1,)]))
key in r_trie # prints True
r_trie.keys() # prints ['Random\x00Key']
r_trie.get(key) # prints [(1,)]

By the way, if you confirm this issue as an actual bug, also check DAWG. I haven't used it extensively, but calling dawg.DAWG(python_dict) should throw the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dawg.pyx", line 45, in dawg.DAWG.__init__ (src\dawg.cpp:2147)
  File "dawg.pyx", line 70, in dawg.DAWG._build_from_iterable (src\dawg.cpp:2570)
dawg.Error: Can't insert key b'Random\x00Key' (with value 0)

I'm running Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32 and using marisa-trie 0.7.5 and DAWG 0.7.8.