semantics-for-personal-health / heals-notebook

0 stars 0 forks source link

Use a more appropriate entity resolution algorithm #7

Open stouffers opened 2 years ago

stouffers commented 2 years ago

Currently, the library TheFuzz (formerly fuzzywuzzy) is used for entity resolution. This function looks for entities with a low Levenshtein distance against the entire query string. It works okay for short queries, but as the query gets larger, keyword Levenshtein distances increase and stop matching.

A more appropriate algorithm would search for matching (and ideally close-to-matching) substrings rather than matching against the whole query. Ideally we would use a stable implementation rather than writing ourselves.

Ideas: