Linear value calculations, Filter by required properties, optimizatio…

markodjurovic commented 9 years ago

Added features:

Ability to use linear properties values compare (if set in config) by using formula:

sim * (high - low) + low;

Also if this feature is turned on overrall similarity is calculated by formula:

retVal = prob / sumOfHighPropertyProbability;

where prob is sum of all similaritu values for each property, and sumOfHighPropertyProbability is sum of all high values for each property. Do that with setting property:

<linearMode on="true" />

Ability to set in config possibility to avoid calculation item2 -> item1 if item1 -> item2 is already calculated (Instead n^2 calculations , to have (n * (n-1)) / 2 (binomial of n and 2) calculations, with setting property:

<reverseOptimization on="true" />

Different behaviour for properties which lookup value is set to "required'. So if in config there is:

<treatRequiredPropertiesAsFilter on="true" />

candidates found by properties which lookup cvalue is true, will be filtered and will only remain those candidates which required properties has same value as requirde properties of original item.

In config file these flags should be set at begining.

larsga commented 9 years ago

There are a few problems with this pull request. The first is all the automated changes your IDE has made. There are so many of them that I'm having difficulties finding the actual changes. If you can get rid of this I may be able to review this pull request properly.

What is the benefit of the linear mode? Why would someone use it instead of the naive Bayes mode?

The reverse optimization sounds like it could be useful, but I couldn't find the actual code in the PR because of all the noise.

The same goes for treatRequiredPropertiesAsFilter.

If you could redo the PR I'd be interested to see this.

markodjurovic commented 9 years ago

Hi larsga, thanks on your comments. I've fixed mess with automatic code indent.

Idea with linear mode is based on that that I am using Duke not just for deduplication purposses but also for general simialrity between items. In linear mode I can tweak my data model in easier manner, keeping my comparators as simple as it is possible.

Best, Marko.

larsga / Duke

Linear value calculations, Filter by required properties, optimizatio… #210