moov-io / watchman

AML/CTF/KYC/OFAC Search of global watchlist and sanctions
https://moov-io.github.io/watchman/
Apache License 2.0
334 stars 88 forks source link

Experimental Improved Search Algorithm #524

Closed tomdaffurn closed 10 months ago

tomdaffurn commented 10 months ago

This is a re-write of the jaroWinkler function with the goal of improving the scoring performance. The new algorithm changes several things:

  1. Compare tokens in the search to the index tokens
    • i.e. "find matches for every search token" rather than "find match for every indexed token"
    • Improves scores of searches that don't include "middle" names
    • Prevents sanctioned names that are 1 word (HADI, EMMA, KAMILA) matching long searches
    • Has a side-effect that short search terms will have more false positives. I think this is a good trade off as the sanction lists will always contain the full name, but the search might not
  2. Once a token has matched something, it can't match a different token
    • This prevents names with repeated words having artificially high scores
    • e.g. prevents any search containing "Vladimir" matching "VLADIMIROV, Vladimir Vladimirovich"
  3. Weights each word-score by the length of the word, relative to the search and indexed name
    • This corrects for error that is introduced by splitting names into tokens and doing piecewise Jaro-Winkler scoring
    • Combing word-scores using a simple average gives short words (like Li, Al) equal weight to much longer words
    • The length-weighted scores are comparable to what you get by doing whole-name to whole-name Jaro-Winkler comparison
  4. Punishes word-scores when the matching tokens have significantly different length
  5. Punishes word-scores when the matching tokens start with different letters

The resulting search behaviour has significantly better true positive rate AND false positive rate. Examples of this are shown in cmd/server/new_algorithm_test.go.

I've done testing with 2000 real customer names, and with 50 sanctioned names. The aggregated results are below. I can share the 50 sanctioned names data, but the 2000 customer names are too sensitive to share.

I haven't fixed all of the tests and written enough new tests, but I'm happy to do so if you like this change.

image image
adamdecaf commented 10 months ago

This is excellent @tomdaffurn! Thank you for the contribution. I had been thinking about how to implement a couple of these improvements, but your solution is excellent. From the results I've seen this could be merged and replace the existing algorithm. We've made similar releases in the past.

tomdaffurn commented 10 months ago

Thanks for the review and tick Adam! You've got a great tool here, and it's fun to work on.

There were some linting errors in my code, so I've fixed those and added to README.md

codecov-commenter commented 10 months ago

Codecov Report

Merging #524 (2532acd) into master (1ef25be) will increase coverage by 1.63%. Report is 7 commits behind head on master. The diff coverage is 0.00%.

:exclamation: Current head 2532acd differs from pull request most recent head ddb46e7. Consider uploading reports for the commit ddb46e7 to get more accurate results

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #524 +/- ## ========================================= + Coverage 8.18% 9.81% +1.63% ========================================= Files 44 38 -6 Lines 3531 2811 -720 ========================================= - Hits 289 276 -13 + Misses 3219 2511 -708 - Partials 23 24 +1 ```