[FEAT] Word-level similarity metric

moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

MIT License

1.26k stars 145 forks source link

Is your proposal related to a problem?

I need to find similarities in company names. Often, words are added ("limited" etc) or removed (e.g. First name in personal companies). This can significantly change similarity indexes, e.g. I need to go down to ~ 0.76 match level on Jaro Winkler to catch such true matches but this is also matching clearly unrelated company names.

Describe the solution you'd like

Use TF-IDF with cosine similarity, with word-level tokenisation (split on spaces) like this tf_idf_cosine_similarity(l.name, r.name).

Describe alternatives you've considered

Calculate Jaccard similarity, but on the set of words, rather than the set of characters. This will allow me to identify company names with "enough" common words rate, without the negative effects of lowering Jaro rate too much. Could pre-calculate a "set of words" column to boost performance.

Additional context

Essentially I am looking for a metric that is less sensitive to addition or removal of whole words than to minor changes within the words (as even a one-letter change would count as 2 different words).

Thanks - your proposal makes complete sense in terms of why you'd want this.

The term frequency part is very hard to achieve as a built-in Splink feature without introducing more SQL joins and therefore making the codebase significantly more complicated. The only approach I can see would be somehow pre-preparing the data so that a company name was represented as a list of dicts: [{"name": ford, "tf":0.0001}, {"name": "motor", "tf":0.003}, {"name": "company", "tf":0.01},]

You could then define the similarity by using something like an array intersect weighted for term frequency using the various nested functions e.g. here, or a cosine measure like you suggest

At the moment, the only out-of-the-box option you really have is array intersect level, see here but I recognise that doesn't really fully meet your needs.

You can also use a custom comparison if you wanted to have a go at the tf approach outlined above

moj-analytical-services / splink