larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

Cleaner interface should receive the whole record #235

Open marco-brandizi opened 7 years ago

marco-brandizi commented 7 years ago

the Cleaner interface currently receives the value of the field to which it was associated and is supposed to clean that value. I have a use case where I normalise company names, considering bits like "Inc", or "Ltd". The problem is those affixes vary with the country (e.g., in Italy I should consider 'Ltd' as a possible original part of a name, while I should normalise the equivalent 'Srl').

I have a country -> affixes map to deal with that, but the company country is in another field, I need to pass it to the cleaner. I've managed to do so by concatenating (and then splitting) two fields (in the data query), however, it would be much more practical if I had something like Cleaner.clean(value, record), in addition to the current clean(value). In Java 8, that could be done without breaking legacy code (using default methods).

larsga commented 7 years ago

Yeah, this is definitely a valid point. I've had the same issue myself.

And I agree with the fix. Extending the signature and making a fallback so legacy code can continue to work would be ideal.

Patches welcome! :)