larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 194 forks source link

Make set comparators that actually work on sets #166

Open larsga opened 10 years ago

larsga commented 10 years ago

That is, instead of working on a set of the tokens in a string, work with multi-value properties, and compare the sets of values for the property. This needs some fundamental changes in how properties are compared.

YannBrrd commented 10 years ago

Hi,

What do you have in mind ? Keep the best score against the whole list ?

Cheers, Yann

larsga commented 10 years ago

No, actually to compare the sets of values using Jaccard/Dice. Remember that Duke records can have multiple values for a single property. Thus, we can treat these as sets of values and compare the sets.

ztsmith commented 10 years ago

Maybe this is overly simplistic, but couldn't we just change the split function to use a configurable split-on value (rather than default to splitting on space)? So rather than splitting multi-values during the cleaning, it is done during comparison.