pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.04k stars 92 forks source link

Trie objects don't support comparison #16

Closed superbobry closed 9 years ago

superbobry commented 9 years ago

A trivial example:

>>> from marisa_trie import Trie
>>> Trie() == Trie()
False
>>> Trie(["foo", "bar"]) == Trie(["foo", "bar"])
False

There's one more interesting property: different tries seem to hash to the same value:

>>> hash(Trie())
271079393
>>> hash(Trie(["foo", "bar"]))
271079393
>>> hash(Trie(["foo", "bar", "boo"]))
271079393

This might be due to a free list-based allocation, but anyway the behaviour is confusing.

kmike commented 9 years ago

Hi @superbobry,

I'm fine with adding __eq__ and __ne__ methods to Trie and other classes; pull requests are welcome :)

As for the hash, it looks like a consequence of a free list-based allocation indeed; two objects that are in memory at the same time are getting different hashes:

In [7]: from marisa_trie import Trie
In [8]: trie1 = Trie()
In [9]: trie2 = Trie()
In [10]: hash(trie1)
Out[10]: 281853959
In [11]: hash(trie2)
Out[11]: 281853967
superbobry commented 9 years ago

I'll try to submit a PR during the weekend.

two objects that are in memory at the same time are getting different hashes

I can add __hash__ as well, if it's ok with you.

kmike commented 9 years ago

Thanks!

I think adding __hash__ is not necessary - what problem adding it solves? Current implementation doesn't e.g. affect dict collisions because hash is the same only for objects which are not alive at the same time.

superbobry commented 9 years ago

Good point. Having a trie as a dictionary key doesn't make much sense anyway :)

superbobry commented 9 years ago

There's one more edge-case: Trie implements __getitem__ and doesn't override __iter__. This (unfortunately) makes it iterable in CPython:

>>> t = Trie()
>>> iter(t)
<iterator object at 0x102939f28>
>>> for key in t:
...     pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Argument 'key' has incorrect type (expected str, got int)

Is it possible to make the Trie yield keys in __iter__ just like dict does?

kmike commented 9 years ago

Ouch. Didn't know about that Python feature. Overriding __iter__ to return keys makes sense.