pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.04k stars 92 forks source link

Fuzzy Matching #25

Open bkj opened 8 years ago

bkj commented 8 years ago

I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?

~ Ben

superbobry commented 8 years ago

In theory you can do fuzzy matching with tries via branching search, but a more efficient way might be to use a Levenshtein automata. I've never used it in Python, so I can't recommend a specific package, but there're some, e.g. leven.

bkj commented 8 years ago

Ah thanks. I figured that Levenshtein automata would work for this, but I couldn't find a python implementation. The one that you linked computes Levenshtein distance but according to the docs, Levenshtein automata are still to be implemented.

There is a node module (https://www.npmjs.com/package/node-levenshtein-automata) that implements Levenshtein automata, but I haven't had a chance to take a look at it yet.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

~ Ben

superbobry commented 8 years ago

JFYI there's another blog post with Python code for building and using the automata.

I may look into implementing a fuzzy (branching) search in marisa-trie in the near future. I imagine there might be some interest in this, since I haven't been able to find much about in-memory fuzzy searching (esp. in Python).

Feel free to submit a PR. I'm not sure if anyone uses tries for insertions/deletions (as in Leveshtein distance), though, so if you come up with a simpler mismatch-only solution, it would be useful as well.

kmike commented 8 years ago

Hey,

I think it won't be possible to use marisa-trie efficiently for that, as it lacks an ability to do step-wise walking (see https://code.google.com/p/marisa-trie/issues/detail?id=9 and https://github.com/kmike/marisa-trie/issues/20). https://github.com/kmike/DAWG or https://github.com/kmike/DAWG-Python may be a better fit.

fgregg commented 7 years ago

Here's a fast Levenstein search. It uses ternary search trees. https://github.com/mattandahalfew/Levenshtein_search