Closed tomdaffurn closed 10 months ago
This is excellent @tomdaffurn! Thank you for the contribution. I had been thinking about how to implement a couple of these improvements, but your solution is excellent. From the results I've seen this could be merged and replace the existing algorithm. We've made similar releases in the past.
Thanks for the review and tick Adam! You've got a great tool here, and it's fun to work on.
There were some linting errors in my code, so I've fixed those and added to README.md
Merging #524 (2532acd) into master (1ef25be) will increase coverage by
1.63%
. Report is 7 commits behind head on master. The diff coverage is0.00%
.:exclamation: Current head 2532acd differs from pull request most recent head ddb46e7. Consider uploading reports for the commit ddb46e7 to get more accurate results
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
This is a re-write of the
jaroWinkler
function with the goal of improving the scoring performance. The new algorithm changes several things:The resulting search behaviour has significantly better true positive rate AND false positive rate. Examples of this are shown in
cmd/server/new_algorithm_test.go
.I've done testing with 2000 real customer names, and with 50 sanctioned names. The aggregated results are below. I can share the 50 sanctioned names data, but the 2000 customer names are too sensitive to share.
I haven't fixed all of the tests and written enough new tests, but I'm happy to do so if you like this change.