Timing of Levenshtein Automaton

When the lexicon is reasonably large I am not shocked that matching could take up to a few seconds. A few things that might help:

Use a smaller lexicon (obviously)
Consider using closest_match rather than closest_matches so as to avoid the pruning step
If certain types of edit operations don't make sense for your application, you can disable them; for instance if you believe the relationship between the query and the lexicon entry never involves a deletion, you can remove the deletion arcs in the edit factors
You can put an upper bound on the distance between the input and the output, which will drop this entire operation into a lower complexity class. I've made a sketchy example of how you might implement this here: https://gist.github.com/kylebgorman/9bce642c1b986e96ff6ce3a518769c74

I hope at some point to write that method up.

On Fri, Aug 17, 2018 at 4:44 PM funderburkjim notifications@github.com wrote:

Hi! I carried through a test of the Levenshtein Automaton matching for a lexicon of about 200k Sanskrit words. The work is in this repository https://github.com/funderburkjim/pynini-learn, specifically in the ejf07 directory.

The automaton usage works fine, and is elegant. However, the efficiency was less than I had hope.

Does the 5+ sec per closest match surprise you? Is there some optimization step that would drastically reduce this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/EditTransducer/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJuOe54SaLL7pLk6RB31htg4_GFo08Hks5uRyshgaJpZM4WCK8x .

kylebgorman / EditTransducer

Timing of Levenshtein Automaton #1