Open b-r-u opened 4 years ago
I see two main issues with your approach:
What if one attribute is already at an optimum? Consider the function f(x,y) = -x² + g(y)
that shall be maximized. Assume a step size of 1 and the optimum at x=0 has already been found, although y is still unknown. Following your proposal, the gradients are
gradient_x = f(x + step_x, y) - f(x,y) = f(1,y) - f(0,y) = (-(1²) + g(y)) - (-0² + g(y)) = -1
gradient_y = g(y + 1) - g(y)
Ignoring the scale, the next point would be at x = -1, y = whatever. But we already know that x=0 is the optimum! Because in your proposal you only check one side (which in this case is worse) you make an automatic assumption about the other side (namely, that in this case it must be better, when in fact it isn't). All your new points will be worse than the one you started at.
The obvious solution is to compute the gradients using points around the current value, e.g. +val_step and -val_step. This doubles the computations needed.
All the examples so far had only one objective. Our optimization has two objectives. Therefore, you get two different gradients for each attribute. The current implementation sidesteps this problem by checking dominance between generations.
A forward difference makes assumptions about the backward difference
[...] All your new points will be worse than the one you started at.
You're right, I did not consider this case! When dealing with only one attribute it's fine to move away from the optimum, because you will find out at the next step and hopefully not too much time is wasted by computing multiples in the wrong direction. But when dealing with multiple attributes at the same time it's important that optimal dimensions have a gradient of zero.
Our problems are multi-objective
I'm not sure how to approach gradient ascent with multiple objectives, yet. There is some research when searching for "Multiple-Gradient Descent Algorithm". In my understanding it would still be beneficial to find the optimum for each objective as these points are guaranteed to be on the pareto front.
The implementation could perform a gradient ascent for each objective and maybe generate a third gradient which is a random linear interpolation between these two gradients.
There are two related issues with the current gradient ascent implementation.
1. Attributes are assumed to be independent
The documentation states that attributes are assumed to be independent and thus they are optimized independently. But if two or more attributes are dependent this approach will likely fail to find the local optimum.
Considering a function like
f(x,y) = (x-y)^2+(0.5*(x+y))^2
. Starting from the lower left quadrant the current approach will also end up in the lower left quadrant and not reach the optimum atx=0, y=0
.2. Central difference is expensive
The current implementation uses a central difference to approximate the gradient which takes two samples - one from each side of the current point. While optimizing a single attribute, one of these two samples is likely cached from the previous step and does not have to be computed again. But when changing more than one attribute (as I propose) new samples are unlikely to be cached.
A forward/backward difference only needs one sample per attribute in addition to the current point.
A proposal
The gradient at point
(x , y)
with the fitness functionf
could be computed by using the forward difference for each attribute independently (or by switching to the backward difference if a value would be out of bounds):New individuals can now be generated along the direction of the gradient by first normalizing the gradient and then adding multiples of the gradient to the current point.