tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 413 forks source link

WeightedLevenshtein ins/del weights. #46

Closed ewanmellor closed 6 years ago

ewanmellor commented 6 years ago

Extend WeightedLevenshtein to have customizable insert / deletion weights. Previously, insert / deletion weights were hardcoded at 1.0. Customizing them allows the caller to under-weight the insertion of a thin letter like I or l to reflect the likelihood of OCR errors (for example).

This adds a new interface, CharacterInsDelInterface, which is an adjunct to CharacterSubstitutionInterface. The old behavior is preserved if the caller does not provide a CharacterSubstitutionInterface subclass.

This also adds insert / deletion tests to the old WeightedLevenshteinTest.testDistance, and adds a new testDistanceCharacterInsDelInterface test.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.1%) to 94.949% when pulling cfcde791e2bbbe50fcdec2e3c3a983722113d6fc on NationalBI:weighted-levenshtein-ins-del into a5d842111753f77bb679c82c37628338f868aec8 on tdebatty:master.

tdebatty commented 6 years ago

Nice job! Thanks!

tdebatty commented 6 years ago

I just created release 1.1.0 with your contribution. Should be available within 24h on Maven...

ewanmellor commented 6 years ago

It's working great, thanks for the fast release @tdebatty