sindresorhus / leven

Measure the difference between two strings with the fastest JS implementation of the Levenshtein distance algorithm
MIT License
715 stars 27 forks source link

parameter for maximum distance #14

Open bittlingmayer opened 5 years ago

bittlingmayer commented 5 years ago

This is the most performant Levenshtein implementation for NodeJS of which I know, and I have a though on how to make it faster for many applications.

When testing many string pairs by similarity (for example, when sorting, or filtering strings above or below a threshold), often we are happy to short-circuit on any pair with a distance greater than max.

> leven('abcdef', '123456')`
6
> leven('abcdef', 'abcdefg')`
1
> leven('abcdef', '123456', 3)`
3
> leven('abcdef', 'abcdefg', 3)`
1

At scale, given that the least similar strings are also very expensive to compute, there is a big potential savings. (For one, just comparing length is enough to discard many candidates.)

(For pairs with a distance above max, the distance returned can be null, max or max + 1 - I don't have a strong opinion on which is best, but probably max is the most useful, as it would require not extra handling for sorting or filtering.)

sindresorhus commented 5 years ago

Makes sense. PR welcome. Should be an options-object with an option called maximumDistance.

sindresorhus commented 5 years ago

but probably max is the most useful, as it would require not extra handling for sorting or filtering.)

👍

Yomguithereal commented 5 years ago

Feel free to take inspiration from this code: https://github.com/Yomguithereal/talisman/blob/master/src/metrics/distance/levenshtein.js#L230-L340

ka-weihe commented 5 years ago

This is not the fastest Levenshtein implementation for NodeJS. In fact it is not even the second fastest. The fastest is: https://github.com/ka-weihe/node-levenshtein

sindresorhus commented 4 years ago

If anyone wants to work on this, see https://github.com/sindresorhus/leven/pull/15 for the previous attempt and the feedback there.