intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Domains: how to implement a new/different domain #33

Closed gabe2001 closed 3 years ago

gabe2001 commented 4 years ago

Hi, excellent library and I'd love to apply the functionality to other domains. The current code uses address details. As far as I can tell the Element classes would have to be rewritten to accommodate another domain. Am I missing something? cheers, -gabe

manishobhatia commented 4 years ago

Hi Gabe,

The intention of the library is to be domain agnostic, and the fuzzy match be driven entirely by "ElementType".

Currently there are quite a few domain specific type defined like "Name", "Address", "Phone Number", "Email", etc And some generic types like "Text", "Number", "Date" The idea is to expand these types as we get more requirements from the open source community and make the library useful in multiple domains

That said even without an enhancement to this library, it supports overriding the default behaviors and make it useful. For example the Element API allows you to override most of its matching capability https://github.com/intuit/fuzzy-matcher#element-configuration

Out of these configuration the "PreProcessingFunction" and "TokenizerFunction" gives an ability to inject user defined code at run time (by means of Java Functions), and provides additional flexibility to match most types of data.

If there are specific use cases you run into, feel free to send some details and example data sets, and we can look at including it in our next release.

Hope this helps.

gabe2001 commented 4 years ago

Hello Manish,

Thanks for your explanation! I'm perfectly happy with current functionality. The ask was around making a domain (parameters/value types/etc.) exchangeable. If I want to create a model which is not person related I'd have to "live" with these, ignore them, and add new ones which are of no interest to people related properties. Perhaps a enhancement idea/request to have the domain implemented as a plug-able (interfaces?) feature.

cheers, -gabe

manishobhatia commented 4 years ago

Hi Gabe,

I like the idea of having plug-able interfaces for various domain , that can enable easy matching. Will take that into consideration in the next iteration of our release.

In the meantime, I wanted to assure that there is little to no impact on having multiple ElementTypes present (both in terms of memory or cpu usage), even if it is not used.

The ElementType are simple easy to use ENUM's which itself is made up of different combinations of Pre-Processing function, Tokenizer Function and Match Type This just makes it easy for the end-user to implement matching without dwelling too much into the library.

There was a an issue posted earlier, which in-fact alluded to the fact of removing ElementTypes altogether with a similar concern of not making it domain specific. Personally I like your suggestion better. Will take both this POV's into account, as the library evolves.

We are interested in knowing as to which domains this library has been applied to, to help inform our direction. We have had quite a few usage in the realm of Person and Transaction matching domains. Feel free to comment on which domain you see it being useful.

Thanks again for suggestions and helping this project move into the right direction.

gabe2001 commented 4 years ago

I believe the sky is the limit here.

Dating sites are an excellent example of variable amount of properties/attributes to be matched with. Job searching: finding a good candidate based on skills matched with a particular job description.

gabe2001 commented 2 years ago

@manishobhatia, after some time I've finally been able to give your library a try! In hindsight my question was completely irrelevant. I did myself allow to be mislead by the addresses example. As you said, the library is completely domain agnostic. The naming of the ElementType function enumerations suggests otherwise. With that out of the way, what I believe could be beneficial is the addition of a "in" matching. Does a value exist in a set of values, initially with an exact match. The workaround currently is to duplicate the documents for all permutations of the list or lists of values. cheers, -gabe