larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

Linear value calculations, Filter by required properties, optimizatio… #210

Closed markodjurovic closed 6 years ago

markodjurovic commented 9 years ago

Added features:

sim * (high - low) + low;

Also if this feature is turned on overrall similarity is calculated by formula:

retVal = prob / sumOfHighPropertyProbability;

where prob is sum of all similaritu values for each property, and sumOfHighPropertyProbability is sum of all high values for each property. Do that with setting property:

<linearMode on="true" />

<reverseOptimization on="true" />

<treatRequiredPropertiesAsFilter on="true" />

candidates found by properties which lookup cvalue is true, will be filtered and will only remain those candidates which required properties has same value as requirde properties of original item.


In config file these flags should be set at begining.

larsga commented 9 years ago

There are a few problems with this pull request. The first is all the automated changes your IDE has made. There are so many of them that I'm having difficulties finding the actual changes. If you can get rid of this I may be able to review this pull request properly.

What is the benefit of the linear mode? Why would someone use it instead of the naive Bayes mode?

The reverse optimization sounds like it could be useful, but I couldn't find the actual code in the PR because of all the noise.

The same goes for treatRequiredPropertiesAsFilter.

If you could redo the PR I'd be interested to see this.

markodjurovic commented 9 years ago

Hi larsga, thanks on your comments. I've fixed mess with automatic code indent.

Idea with linear mode is based on that that I am using Duke not just for deduplication purposses but also for general simialrity between items. In linear mode I can tweak my data model in easier manner, keeping my comparators as simple as it is possible.

Best, Marko.