[QUESTION] Use of Multiple Emitters of the Same Type vs Single Emitter with Larger Population

LemonPi commented 1 year ago

I am currently working on a problem where I formulated a differentiable cost function. So far, I've used CMA-ME and CMA-MEGA (EvolutionStrategyEmitter and GradientArborescenceEmitter, respectively) to about similar effect. I re-run the methods as I accumulate more information about the environment after an execution step (of a robot), each step feeding the new archive with the solutions (elites) from the previous archive over all bins. A new archive is required because 1. the archive size might change and 2. the archive values are invalid given the new execution step.

In theory, having gradient knowledge should accelerate exploration and result in better elites found more quickly, but this doesn't seem to be the case (although I am currently just monitoring the total solution quality after each step, rather than monitoring solution quality such as through the QD score during each optimization iteration). It could be that both methods already find close to the optimal solutions in the given time (1000 optimization iterations) - I will need to monitor the QD scores during the optimization to comment on this.

Currently, I am using a single emitter in both the CMA-ME and the CMA-MEGA case. For CMA-ME I have a EvolutionStrategyEmitter of batch size 30 while for CMA-MEGA I have a GradientArborescenceEmitter of batch size 29. What is the recommendation regarding the number of emitters and their batch size? I have a 2D archive with 40 bins in each dimension (solution space is 9D).

btjanaka commented 1 year ago

Hi @LemonPi, for CMA-ME and CMA-MAE, we usually run with 5-15 emitters each with batch size 30-40. I think 5 emitters with batch size 30 would be a good starting point for you. It's useful to have multiple emitters so that they can explore multiple parts of the archive at once. For CMA-MEGA, we have tended to use just one emitter in the past because we assume gradient computation is expensive (each emitter entails an extra gradient computation). If your gradient computation is cheap, trying more emitters would be a good way to go; again, 5 emitters with batch size 30 would be a good place to start.

LemonPi commented 1 year ago

@btjanaka Thanks for the advice! Testing out with a number of emitters now (performance with a single emitter is already quite good). I'm not really seeing much improvement from the GradientArborescenceEmitter (despite being quite a bit slower) - can you confirm if I'm feeding it the right gradient? My cost function self._f(x) is differentiable

        solutions = self.scheduler.ask_dqd()
        bcs = self._measure(solutions)
        # evaluate the models and record the objective and behavior
        # note that objective is -cost
        # get objective gradient and also the behavior gradient
        x = ensure_tensor(self.device, self.dtype, solutions)
        x.requires_grad = True
        cost = self._f(x)
        cost.mean().backward()
        objective_grad = -x.grad.cpu().numpy()
        objective = -cost.detach().cpu().numpy()
        # get measure gradient with respect to x (grad of bcs with respect to x)
        behavior_grad = ...
        jacobian = np.concatenate((objective_grad.reshape(self.num_emitters, 1, -1), behavior_grad), axis=1)
        self.scheduler.tell_dqd(objective, bcs, jacobian)
        # do non-dqd update with scheduler.ask() and scheduler.tell()

btjanaka commented 1 year ago

Your function looks correct. The part where you reshape the jacobian also looks correct -- we also check the shape in tell_dqd so it would throw errors if your reshaping/concatenating was incorrect.

I've never used more than one GradientArborescenceEmitter myself, so it's hard for me to say if doing so would actually lead to an improvement -- it's entirely possible that this just won't help anything.

Regarding slowness, gradients are typically slow hence why we only use one emitter. It looks like you're using autodiff, which would probably be slower for more solutions.

How much coverage are you seeing, and what is the maximum possible coverage for your problem? If it's already pretty close to maximum, there may not be much we can do.

Another idea is to use CMA-MAE -- are you familiar with that algorithm already?

LemonPi commented 1 year ago

I haven't read about that variant yet but will do so. The goal of my problem isn't actually to explore the whole space, but to ensure there are sufficiently high quality but diverse solutions available (so from the whole archive, I'll select let's say 30 of the best solutions). The coverage surprisingly vacillates between fully covering the archive to being highly localized between execution steps - not really sure what's going on there (example of this in the pictures below).

One advantages of the method we're developing is that it's differentiable and so we'd like to use the gradient for exploration, but maybe it's just not very useful in this case. Direct gradient descent on the gradient does lead to good solutions (that are often too close together); we're using a quality diversity method to enforce diversity.

btjanaka commented 1 year ago

I see. In that case, CMA-MAEGA may help you -- the idea of CMA-MAE is that you can balance between always favoring diversity (which is what CMA-ME does) and always favoring optimization (which is what CMA-ES does); CMA-MAE has a learning rate parameter which lets you choose an algorithm somewhere between CMA-ES and CMA-MAE. Hopefully this will let you perform more optimization than you're currently seeing with CMA-ME. We have a preliminary CMA-MAE tutorial here: https://docs.pyribs.org/en/latest/tutorials/cma_mae.html and I'm also working on one which uses Lunar Lander.

icaros-usc / pyribs

[QUESTION] Use of Multiple Emitters of the Same Type vs Single Emitter with Larger Population #298