intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

New Element Type for product names #65

Closed ffuf-schilling closed 1 year ago

ffuf-schilling commented 2 years ago

I've been trying to compare model names of different machines. I played around with the configuration of my elements but I'm not getting good results.

I would like to get a high match score if one string contains the similar smaller string

STP375S-B60/Wnh_1500V_20V02_1756 STP 375S-B60/Wnh

If there is a configuration that would result in my desired score that would be the best but if not then I suggest having a new Element type "Product Designation" that could compare strings like "M80 PDF C" sensibly.

manishobhatia commented 2 years ago

Yes, this scenario is supported through configuration, and follows a similar pattern described in NGram token match

The basic idea is to break the elements into subset of strings (grams) and match them. So in the example you gave the 2 strings will be broken down in this fashion (assuming tri-gram subsets)

STP375S-B60/Wnh_1500V_20V02_1756 -> ['STP', 'TP3', 'P37', '375', '75S', ...., 'Wnh', 'nh_', 'h_1' ..... '756']
STP 375S-B60/Wnh                 -> ['STP', 'TP3', 'P37', '375', '75S', ...., 'Wnh']

So when calculating the results the matching grams will shore up the score for strings having similar substring. You will most likely never have a 100% match, but you can setup a lower threshold which works for your use case.

Here are the configurable items I would recommend to experiment with

Once you have a good general solution for matching product names using either of this configuration, or any additional code, will be happy to include this as a new element type. Feel free to open a PR to get this going

Hope this help

Thanks

manishobhatia commented 1 year ago

Closing this issue, Feel free to open one if additional support it needed