intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Is there any way to create my own matchers? #62

Closed nicolasalba closed 2 years ago

nicolasalba commented 2 years ago

Hi @manishobhatia,

I couldn't find a way to create my own matcher trying to solve the following problem:

I have document numbers and I would like to compare them with others by the digit they share. For instance, document A = 12305211C LPZ, document B = 12105321C CBA. The matching score is 1.0

Explanation:

Input A: After preprocessing: 12305211, (deleting non-digit chars) After tokenization: int[] occurrencesA, for instance occurrencesA[1] is 3

Input B: After preprocessing: 12105321C, (deleting non-digit chars) After tokenization: int[] occurrencesB, for instance occurrencesB[1] is 3

Matching Function A with B: (number of occurrences) / (length of A).

I would appreciate it if you have any workaround or approach to solve the problem with the library.

Thank you very much, I wish you the best

manishobhatia commented 2 years ago

Hi Nicolas,

There are 2 advance features that can potentially help you in your use case

  1. Custom Pre-Processing function - When you create an Element you can add .setPreProcessingFunction() which takes in a java function and have custom logic defined there. example code
  2. Custom Tokenizer function - On the same lines, you can also define how you want your data to be tokenized. For that there is a .setTokenizerFunction() in the Element example code

With this, your data is both preprocessed and tokenized into values that you want to match. The default matching algorithm will calculate score on (tokens matched / number of tokens). (NB: The matching algorithm for numerical and date types calculate score on how close the values are numerically. So if you don't need that and are matching exact digits in a number, just use the TEXT type to define the Element )

Hope this help

Thanks