tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 413 forks source link

A Ratcliff-Obershelp implementation would be helpful #44

Closed pellcorp closed 4 years ago

pellcorp commented 6 years ago

There is a library which has an accurate implementation but its based on scala.

mpkorstanje commented 6 years ago

This appears to be a good reference:

https://ilyankou.files.wordpress.com/2015/06/ib-extended-essay.pdf

pellcorp commented 6 years ago

Yep Ratcliff scores better in many cases than jaro winkler

lvjiujin commented 5 years ago

There is a library which has an accurate implementation but its based on scala.

where is the link?

pellcorp commented 5 years ago

In a poc I did, I used maven coordinates: com.rockymadden.stringmetric:stringmetric-core:0.26.1

This corresponds to github project: https://github.com/rockymadden/stringmetric

I tested the Ratcliff/Obershelp impl in the stringmetric-core project against known good test data

denmase commented 4 years ago

Sorry for bumping this thread up, I ported .Net implementation of Ratcliff-Obershelp (by Ligi, a patch to fuzzystring) within my fork. I'm sorry, I'm a novice to both java and github, so I haven't made a pull request yet. I'll be glad if somebody could help test it. Thank you

paulirwin commented 4 years ago

@denmase If you can submit a PR, I'd be happy to help review it. BTW, I help run the port of this to .NET at https://github.com/feature23/StringSimilarity.NET - but we are a 100% port only, so we do not add new features that aren't added here first.

denmase commented 4 years ago

@paulirwin I'll submit a PR, but pardon me if the coding is not up to acceptable coding standard yet. I know StringSimilarity.NET as well, I use both actually, so thank you for SS.Net. If you want, I can try to add Ratcliff-Obershelp to it too. Thank you for the help.

tdebatty commented 4 years ago

Fixed in 4946f586712e4c91d12d766c62ae495db6506733 and PR #55