pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Repeated keys #40

Open hulikau opened 6 years ago

hulikau commented 6 years ago
import marisa_trie
a = [(u'1', '1'), (u'1', '2')]
tr = marisa_trie.BytesTrie(a)
print tr.keys()

This will output [u'1', u'1'].

I guess, that, this function should returns a [u'1']

I'm ready to fix it, if someone consider, that this is bug.

superbobry commented 6 years ago

Woah, looks like a bug to me. Interestingly, it is not present in marisa_trie.Trie:

>>> marisa_trie.Trie(["foo", "foo"]).keys()
['foo']
>>> marisa_trie.BytesTrie([("foo", b"1"), ("foo", b"2")]).keys()
['foo', 'foo']
hulikau commented 6 years ago

@superbobry In the same time in tests this situation is expectable: https://github.com/pytries/marisa-trie/blob/0.7.5/tests/test_bytes_trie.py#L106.

By the way I understood why this happens: For Trie class everything is perfect. For BytesTrie and others inherited from him, in the constructor pairs (key, value) transform to new_key = key + separator + value. Inside the methods keys and iterkeys you iterate over the set of new_keys and then cut old key back.

for i in range(0, ag.key().length()):
  if raw_key[i] == self._c_value_separator:
    key = raw_key[:i].decode('utf8')

And then you add the same key to the res several times.