pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Possible to walk the trie? #20

Closed EliFinkelshteyn closed 9 years ago

EliFinkelshteyn commented 9 years ago

I don't see any methods that would allow for just proceeding one edge in the trie, for example:

trie = marisa_trie.Trie([u'key1', u'key2', u'kite'])
trie.edges('k') #would return [u'ke', u'ki']

Using the prefixes method for something like this would be very expensive if the trie is big and the prefix is short. Is there some technical detail I'm missing for why implementing a function for this would be costly, or some other reason this isn't implemented? It seems like it's a necessary step in the traversal with the prefixes() method anyway, and quite useful for predictive lookup operations.

kmike commented 9 years ago

It is a marisa-trie C++ library limitation - see https://code.google.com/p/marisa-trie/issues/detail?id=9.

.prefixes() for Trie is implemented using a dedicated function marisa-trie provides, but .prefixes() of RecordTrie is suboptimal because of this limitation.

What is your use case? It is possible to implement .edges() function (or add a support for Iterator objects) in https://github.com/kmike/DAWG and https://github.com/kmike/DAWG-Python; these packages often provide compression rates similar to marisa-trie.

EliFinkelshteyn commented 9 years ago

Use case is anything like fuzzy predictive search, where at each edge forward, a fuzzy comparison is made using some fuzzy distance metrics coupled with score to see if it's worth moving forward past that edge. With the current implementation, I'd need to instead proactively check to see if edges exist (so instead of trie.edges() giving ['ke','ki'], I'd be forced to check all possible next letters following "k" for existence, which gets more and more expensive with bigger alphabets).

By the way, I hope this doesn't come off as criticism. I think all your trie implementations are awesome, and looked through DAWG and dat-trie for similar functionality last night. I'll try to add this functionality myself if necessary, but wanted to check if I'm missing something first.

On Mar 13, 2015, at 4:12 AM, Mikhail Korobov notifications@github.com wrote:

It is a marisa-trie C++ library limitation - see https://code.google.com/p/marisa-trie/issues/detail?id=9.

.prefixes() for Trie is implemented using a dedicated function marisa-trie provides, but .prefixes() of RecordTrie is suboptimal because of this limitation.

What is your use case? It is possible to implement .edges() function (or add a support for Iterator objects) in https://github.com/kmike/DAWG and https://github.com/kmike/DAWG-Python; these packages often provide compression rates similar to marisa-trie.

— Reply to this email directly or view it on GitHub.

kmike commented 9 years ago

Thanks!

datrie has State and Iterator objects which allow to implement any iteration in user code. It has some issues though - memory savings are much smaller than in DAWG/DAFSA or marisa-trie, and building of datrie can be slow for large tries (see https://github.com/kmike/datrie/issues/12). I don't use datrie myself, and the development stalled because for my use cases one of DAWG, marisa-trie or hat-trie usually work better than datrie. There are also some open issues in tracker.

DAWG doesn't have a public API for custom iteration, but this API can be implemented. Check https://github.com/kmike/DAWG-Python/blob/master/dawg_python/wrapper.py and how are classes from there used in https://github.com/kmike/DAWG-Python/blob/master/dawg_python/dawgs.py (DAWG package has a very similar coe, but in Cython). Pull requests are welcome :)

Unfortunately we can't create a similar API for marisa-trie because of C++ library limitations.

I haven't added a public API for iteration to DAWG because of speed issues. Lookups are going to be much slower if iteration is done in Python instead of Cython, that's why there are many optimized methods for common use cases instead of a couple of generic iterator helpers.

kmike commented 9 years ago

If you have suggestions for new optimized methods for DAWG (e.g. for fuzzy match) then I'm open to PRs; it could be a good addition to the library.

EliFinkelshteyn commented 9 years ago

Alright, I'll investigate a bit more, and if it still looks useful, I'll send over a PR. Closing the issue out here, since it'd be much easier to do for DAWG than the Marisa-Trie.

jstypka commented 8 years ago

Actually, this is the second time that this feature would be very useful to me, so I thought I might mention it here. I also looked through other Python trie libraries (PyTrie, patricia-trie, python-trie) and neither of them support this.

My usecase (which I would expect to be not that rare) is walking down the tree while computing Levenshtein distance, I use it for fuzzy matches similar to @EliFinkelshteyn.