intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
228 stars 69 forks source link

More documentation on how to use #10

Closed arun-meena closed 5 years ago

arun-meena commented 5 years ago

Hi @manishobhatia I like the library and the documentation. But i was hoping if you can provide some more information on how to implement it in a project(like spring boot). As i am new to java i am having bit of problem in the implementation.

manishobhatia commented 5 years ago

Hi, this library can be used with core java features. To instantiate the service, you just need to create a new object of MatchService class, for example : MatchService matchService = new MatchService();

There are 3 primary ways you can run matches, depending on the use case you are trying to match. They are detailed in this section. https://github.com/intuit/fuzzy-matcher#applying-the-match

To use it with spring, or any dependency injection library, we can instantiate the MatchService as a singleton and use it multiple times. So if you have a spring component class, just insatiate the service class once, and you can use the getter to access the matchService

here is an example of a simple factory component you can define in spring.

@Component
public class MatchServiceFactory {

    private MatchService matchService = new MatchService();

    public MatchService getMatchService() {
        return this.matchService
   }
}

Now you can auto-wire MatchServiceFactory in other classes and have access to matchService.

Let me know if this helps. You can also ask questions on https://stackoverflow.com/ and tag your questions with fuzzy-matcher to get more detailed working examples for your use case

arun-meena commented 5 years ago

@manishobhatia Thanks for the reply. I was able to use the MatchService, but the output seems to be incorrect. I have wrote the following code please check and verify:

User userSearch = new User(); userSearch.setName("Name Som"); userSearch.setId(4); userSearch.setAddress("ADI"); userSearch.setPhone(321); userSearch.setEmail("name.sam@gmail.com");

   `User userAvailable = new User();
    userAvailable.setName("Name Sam");
    userAvailable.setId(5);
    userAvailable.setAddress("Jaipur");
    userAvailable.setPhone(321);
    userAvailable.setEmail("name.sam@gmail.com");

    Document newDocument = new Document.Builder(userSearch.getId().toString())
            .addElement(new Element.Builder().setType(TEXT).setValue(userSearch.getName()).createElement())
            .createDocument();

    Document searchDocument = new Document.Builder(userAvailable.getId().toString())
            .addElement(new Element.Builder().setType(TEXT).setValue(userAvailable.getName()).createElement())
            .createDocument();

    System.out.println("List of Documents: " + searchDocument);
    System.out.println("List of Documents: " + newDocument);

    List<Document> documentList = new ArrayList<>();
    documentList.add(searchDocument);
    MatchService matchService = new MatchService();
    Map<Document, List<Match<Document>>> map;
    map = matchService.applyMatch(newDocument, documentList);
    System.out.println("Match : " + map);`

And i am getting the following output: List of Documents: {[{'Name Sam'}]} List of Documents: {[{'Name Som'}]} Match : {}

As per the soundex algorithm it should give me a match

manishobhatia commented 5 years ago

You are correct, the words "Sam" and "Som" should give a match when using soundex. The issue you are running into is the MatchOptimizerFunction that is used to improve performance.

To avoid matching every element with every other element, the library tries to skip running match against some element, and because of this text with 3 or smaller letters in it do not go through Soundex matches. There is some explanation on how the optimizer works here https://github.com/intuit/fuzzy-matcher#performance

But that said, if your scenario requires a more accurate matching , and if the match set is not too big requiring the use of optimizer, you can override the newDoument creation with this snippet

Document newDocument = new Document.Builder(userSearch.getId().toString())
            .addElement(new Element.Builder().setType(TEXT).setValue(userSearch.getName()).setMatchOptimizerFunction(none()).createElement())
            .createDocument();

static MatchOptimizerFunction none() {
        return (tokenList) -> {
            return tokenList.parallelStream()
                    .filter(left -> BooleanUtils.isNotFalse(left.getElement().getDocument().isSource()))
                    .flatMap(
                            left -> tokenList.parallelStream()
                                    .filter(right -> right != null && !left.getElement().getDocument().getKey().equals(right.getElement().getDocument().getKey()))
                                    .map(right -> {
                                        double result = left.getElement().getSimilarityMatchFunction().apply(left, right);
                                        return new Match<Token>(left, right, result);
                                    })
                                    .filter(match -> match.getResult() >= match.getData().getElement().getThreshold())
                    );
        };
    }

This essentially overrides the default optimizer with a "none" optimizer, which runs a match on each element, irrespective of its size

I'll include this "none" optimizer in our future release of the library. But hopefully this snippet of code with unblock you for now.

Although for large sets of documents with multiple elements, I think its rare you will run across such issues and you might benefit from the optimizer. Even in the example you have, if you include the Address, Phone and Email elements you will see these 2 documents match, as it has good matches there. Although the overall score will fall a little due to "Sam" not matching "Som", but it will be negligible