intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Any way to add name dictionary for normalization characters #44

Closed markairwallex closed 3 years ago

markairwallex commented 3 years ago

I noticed that we have a name dictionary.txt。 Is any way I can override or add some mapping config in this dictionary image

manishobhatia commented 3 years ago

Hi,

Yes the name and address dictionary can be provided externally. These are both used to pre-process the data before running a match.

To do that we will have to create a custom pre-processing function and pass it while creating the Element. Here are the steps with examples you can use

  1. In case of Name, it's just convenient to remove any titles, salutation, prefix, postfix . So this example makes a java hash-map with words mapped to empty string. (This can be read from file if you would like)
Map<String, String> newNameDict = new HashMap<String, String>() {{
            put("Queen", "");
            put("Third", "");
            put("III", "");
        }};
  1. Create a custom function that applies this mapping to any input

    Function<String, String> newNamePreProcessing = (str) -> {
            return Arrays.stream(str.split("\\s+"))
                    .map(d -> newNameDict.containsKey(d) ? newNameDict.get(d) : d)
                    .collect(Collectors.joining(" "));
        };
  2. Override the pre-processing function when creating an element

        String[][] input = {
                {"1", "Victoria Third"},
                {"2", "Queen Victoria III"},
        };
        List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
            return new Document.Builder(contact[0])
                    .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME)
                            // Set the custom function
                            .setPreProcessingFunction(newNamePreProcessing)
                            .createElement())
                    .createDocument();
        }).collect(Collectors.toList());

Now if this is fed to the MatchService, the name-dictionary.txt is overridden and it uses your custom function to pre-process the data

Map<Document, List<Match<Document>>> result = matchService.applyMatch(documentList);

markairwallex commented 3 years ago

Hi,

Yes the name and address dictionary can be provided externally. These are both used to pre-process the data before running a match.

To do that we will have to create a custom pre-processing function and pass it while creating the Element. Here are the steps with examples you can use

  1. In case of Name, it's just convenient to remove any titles, salutation, prefix, postfix . So this example makes a java hash-map with words mapped to empty string. (This can be read from file if you would like)
Map<String, String> newNameDict = new HashMap<String, String>() {{
            put("Queen", "");
            put("Third", "");
            put("III", "");
        }};
  1. Create a custom function that applies this mapping to any input
Function<String, String> newNamePreProcessing = (str) -> {
            return Arrays.stream(str.split("\\s+"))
                    .map(d -> newNameDict.containsKey(d) ? newNameDict.get(d) : d)
                    .collect(Collectors.joining(" "));
        };
  1. Override the pre-processing function when creating an element
        String[][] input = {
                {"1", "Victoria Third"},
                {"2", "Queen Victoria III"},
        };
        List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
            return new Document.Builder(contact[0])
                    .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME)
                            // Set the custom function
                            .setPreProcessingFunction(newNamePreProcessing)
                            .createElement())
                    .createDocument();
        }).collect(Collectors.toList());

Now if this is fed to the MatchService, the name-dictionary.txt is overridden and it uses your custom function to pre-process the data

Map<Document, List<Match<Document>>> result = matchService.applyMatch(documentList);

Thanks for your reply, yes Override the pre-processing function can implement this. but do we have simple direct way just Override dictionary file eg provide element file path? or do we have plan to do this, cause I think it's a useful feature for customer pre-processing. Thanks a lot in advanced!

manishobhatia commented 3 years ago

Unfortunately there is not an easier way, but we will take this as an enhancement for our next release . Hopefully the above method will unblock you for your immediate needs.

markairwallex commented 3 years ago

Unfortunately there is not an easier way, but we will take this as an enhancement for our next release . Hopefully the above method will unblock you for your immediate needs.

ok, cool really expected the next release. Thanks!