intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Aggregate strings based on similarity into groups. #42

Closed hoanghun closed 3 years ago

hoanghun commented 3 years ago

Is there a way how to aggregate Strings based on similarity into groups? I have list of strings List<String> strings and I need to group them.

MatchService provides applyMatchByGroups which is very close, but in the inner Set contains every match for the group I need instead of only unique values which is what I need.

Thanks in advance for the answer.

manishobhatia commented 3 years ago

Hi @hoanghun , applyMatchByGroups is meant to give results of only matching documents.

See this test for example https://github.com/intuit/fuzzy-matcher/blob/d8a92026527d61efba30d463d791918785456a8d/src/test/java/com/intuit/fuzzymatcher/component/MatchServiceTest.java#L74

Here, the 1st and 3rd documents match, so the result returns just 1 set containing a Match with these two. The non-matching document is eliminated from the result.

The Match result does show both the match values. It does not aggregate the string. If you can share the example you are working through, it might help in understanding how this can be supported.

hoanghun commented 3 years ago

My use case is I have a list with stack traces of exceptions. I want to group them by similarity and get the count. I managed to do that by putting id's of document and matching document into set which then returned the count I needed.

manishobhatia commented 3 years ago

Thanks @hoanghun . That's an interesting use cases, you are correct it just needs an id to make each stack trace unique. And the applyMatchByGroups should group them together.

The only thing you need to be aware of that the matching algorithm looks at every single word while trying to match. And some common words like "at" in stack traces or common package names or common timestamps can skew the results. You might want to consider removing them or you can also use the pre-processing function to remove it.

Hope this helps

manishobhatia commented 3 years ago

closing the issue. Feel free to open a new one, if you things the issue was not resolved , or would like to see some enhancement to the library