larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 193 forks source link

"NO MATCH FOR" evident duplicates #222

Closed Eugeneifes closed 8 years ago

Eugeneifes commented 8 years ago

I have evident duplicates in my database, but Duke constantly returns "NO MATCH FOR"

I also tried to run Duke in Debug mode (with the evident duplicates at the input) - Duke shows, that these two records are same with high probability (as I expected, overall probability = 0.7) , that is higher than my threshold

I should mention, that my database has many columns (250 attributes), and most values are missed (no value, empty), but due to the documentation these fields should be skipped and do not have inpact on the final probability https://github.com/larsga/Duke/wiki/HowItWorks

Properties that lack a value in one of the records are ignored.

I also tried to run Duke with only filled data (about 2-5 filled fields instead of 250 with missing values) - Duke works well So i can conclude, that Duke assigns weight to the missing fields (Although this should not happen due to documentation) I tried to run Duke in both Deduplication and Record Linkage modes - it didn't help

How should i run Duke to make it show me duplicates in my sparse dataset?

larsga commented 8 years ago

This might have to do with what database (backend) you're using. Is Duke searching only fields which have no values? Try setting lookup=true on the relevant properties as described at the bottom here https://github.com/larsga/Duke/wiki/XMLConfig to see if that helps.

Eugeneifes commented 8 years ago

Thanks a lot! lookup=true/false really works for my issue! Was i right, when i guessed that empty fields bring some weight to the overall probability? If so, I still can't understand why empty fields don't skip automatically

larsga commented 8 years ago

Was i right, when i guessed that empty fields bring some weight to the overall probability?

I hope not. That would be a bug. Seriously, I don't see any reason to assume that.

If so, I still can't understand why empty fields don't skip automatically

Matching proceeds in two steps: (1) get candidate records from database, (2) match properties. What happened here was that Duke never found any candidates to match, so the property matching never happened at all.

Eugeneifes commented 8 years ago

Now I see! Thank you again

larsga commented 8 years ago

Np. Does this mean we can close the issue?

Eugeneifes commented 8 years ago

Sure