Some comments - Githubissues

Congratulations, qdo_yahpo and yahpo_gym are excellent projects. I am very impressed with the level of parallelization achieved.

Parallelization can be implemented at the level of fitness evaluation - as you do - or already inside the optimizer itself - an approach implemented in https://github.com/dietmarwo/fast-cma-es (fcmaes).

I used/adapted your code to apply fcmaes to your benchmark problems. See https://github.com/dietmarwo/fast-cma-es/blob/master/examples/yahpo.py fcmaes handles some things differently, may be it is interesting for you to compare the approaches.

fcmaes uses a QD archive shared between parallel processes each running either CVT MAP-Elites or an improvement emitter.
fcmaes uses Voronoi tesselation (see CVT MAP-Elites https://arxiv.org/abs/1610.05729)
Instead of gaussian distribution fcmaes can use simulated binary crossover + mutation as NSGA-II
There is something similar to the "mixed" emitter mode, but fcmaes is more flexible:
- The number of parallel processes allocated to each emitter is configurable
- Improvement emitters not necessarily use CMA-ES (CR-FM-NES, DE, BiteOpt and PGPE being the current alternatives)
- Improvement emitters can be chained (like DE -> CMA) where the following emitter is initialized with the solution from the previous one. Helps with extremely rigged fitness landscapes.
Improvement emitters are initialized with a random solution instead of a niche elite. Seems to work better this way.

Results cannot be directly compared since different tesselation is used (Grid/Voronoi), but my impression is, that both with 1E5 and with 1E6 evaluations per run the results are better than all alternatives tested at qdo_yahpo. Main reason seems to me that CR-FM-NES (see https://arxiv.org/abs/2201.11422) seems to work better than CMA-ES as emitter for the tasked benchmarked by qdo_yahpo. For other tasks the advantage of CR-FM-NES can be even larger, see my test results in https://github.com/google/evojax/pull/52 . SBX/mutation used for Map-Elites may be another reason.

Regarding wall time there is only about factor 2 improvement compared to qdo_yahpo (tested on 16 core AMD 5950x), but fcmaes doesn't require a multi-solution-fitness performing parallelization, since parallelization is handled inside the optimizer. Your parallelized fitness evaluator already does a very good job regarding scaling, otherwise the difference would be much larger.

slds-lmu / qdo_yahpo

Some comments #1