larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 193 forks source link

Try optimizing configurations with hill climbing #159

Open larsga opened 10 years ago

larsga commented 10 years ago

If we don't try to find the comparators to use in configurations, hill climbing might work much faster than genetic algorithms.

Nicolas Maisonneuve suggests: "by only searching high , low parameters, I am wondering if, in such condition, the space is not convex (local optimal = global optimal), meaning that basic optimisation approach like Hill climbing [1,2] will converge much faster than GA to find the best arguments.

(the pb is that the comparator parameter has no notion of order/distance while high , low pb haves a notion of distance that is useful e.g. if with p=0.5 the global result is bad but p=0.6 it produces better results, trying p=0.7 might be interesting, but not 0.4. such heuristic is not possible for the comparators.

[1] http://en.wikipedia.org/wiki/Hill_climbing [2] http://aima-java.googlecode.com/svn/trunk/aima-core/src/main/java/aima/core/search/local/HillClimbingSearch.java"

larsga commented 10 years ago

I tried it now, and it works beautifully.

By default, the genetic algorithm takes 10,000 evaluations to complete.

The hill climber, starting from pretty decent manual configurations, took 93 evaluations to improve the countries.xml case from 0.952 to 0.962. That’s more than 100 times faster, and it could be made even faster by avoiding reevaluating configurations we’ve seen before.

In the more complex cityhotels case, which has been very carefully manually tuned, it took 363 evaluations to go from 0.98 to 0.988.

This is extremely interesting, and obviously has to be implemented.

lidya-cota commented 10 years ago

That sounds amazing! I consider that It would be good if users can choose between using genetic algorithm or hill climbing tuning.