Closed nicolasalba closed 2 years ago
Hi Nicolas,
There are 2 advance features that can potentially help you in your use case
Element
you can add .setPreProcessingFunction()
which takes in a java function and have custom logic defined there. example code.setTokenizerFunction()
in the Element
example codeWith this, your data is both preprocessed and tokenized into values that you want to match. The default matching algorithm will calculate score on (tokens matched / number of tokens).
(NB: The matching algorithm for numerical and date types calculate score on how close the values are numerically. So if you don't need that and are matching exact digits in a number, just use the TEXT
type to define the Element
)
Hope this help
Thanks
Hi @manishobhatia,
I couldn't find a way to create my own matcher trying to solve the following problem:
I have document numbers and I would like to compare them with others by the digit they share. For instance, document A = 12305211C LPZ, document B = 12105321C CBA. The matching score is 1.0
Explanation:
Input A: After preprocessing: 12305211, (deleting non-digit chars) After tokenization: int[] occurrencesA, for instance occurrencesA[1] is 3
Input B: After preprocessing: 12105321C, (deleting non-digit chars) After tokenization: int[] occurrencesB, for instance occurrencesB[1] is 3
Matching Function A with B: (number of occurrences) / (length of A).
I would appreciate it if you have any workaround or approach to solve the problem with the library.
Thank you very much, I wish you the best