pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Allow to use arbitrary sequences as elements, not only strings #13

Open dragoon opened 10 years ago

dragoon commented 10 years ago

I tried to construct the following trie:

trie = marisa_trie.Trie([('New', 'York'), ('New', 'Castle')])

Which gave me AttributeError: 'tuple' object has no attribute 'encode'. So I suppose the library accepts only strings, but sometimes you want other structures.

derpston commented 10 years ago

Have you tried using the RecordTrie instead? (same module)

dragoon commented 10 years ago

I don't really understand this structure, it has some keys and values, while I have only values.

derpston commented 10 years ago

Ah, I see what you mean now, disregard my earlier comment. Yeah, as far as I'm aware it only accepts unicode strings.

kmike commented 10 years ago

@dragoon I'm not sure adding support for having any object as a key is a good idea - because I don't know how to implement it efficiently.

We can't store just an id of object (it defeats the purpose of marisa-trie), so we should somehow serialize the key to bytes to use it as a key. For strings the wrapper encodes unicode input to utf8.

In order to support arbitrary objects we may use pickle, but I'm not sure how compressable is the result, and better task-specific serialization methods usually exists. For example, in your case (a tuple with 2 strings) it makes sense to join the strings using some separator before adding to the trie and split by this separator when retreiving. You don't need marisa-trie support to do this.

But that's true that there are some edge cases (separator inside the tuple element?), splitting/joining tuples could be more efficient if implemented in Cython, and storing tuples of strings is quite common. So I think adding a trie subclass that allows tuples of strings as keys is a good idea - ngram storage is a common use case. Pull requests are welcome :)