Open wbrickner opened 2 years ago
Generally, computing the eigendecomposition is the most expensive part of the algorithm (~30% to ~50% of the total), and matrix multiplications are usually far less (~10%). However, this changes when using very large population sizes, which is frequently done by the IPOP
and BIPOP
restart strategies. With large population sizes, the various matrix multiplications can take up about a third of the running time.
These numbers only include the time spent in CMA-ES itself and do not account for a potentially expensive objective function. They also can vary greatly with different problem dimensions. Also, they were only measured on my machine using the default settings. There is a setting to run most of the matrix multiplications in parallel, which can be a major performance boost with very large population sizes. If you want to explore the distribution of running time further, you can check out the flamegraphs I generated on the simple
example with cargo-flamegraph
or generate your own (you have to download them and open them in your browser):
flamegraph_10d
flamegraph_30d
flamegraph_30d_large_popsize
Offloading matrix multiplications to the GPU could be a large performance boost, and it looks like there are a fair number of other opportunities for optimization as well (mostly allocations and perhaps swapping the RNG out for a faster one).
Ok so I'm going to use this issue to continue our discussion over from eant2
now that my work primarily concerns GPU acceleration of CMA-ES.
I'm almost finished, and there should only be a couple copies that are absolutely required if we are not allowed to assume the user's objective function also uses the GPU.
The GPU programming part is actually pretty straightforward, and some of the refactor to keep all memory on the GPU is straightforward.
However, even to support the changes I've made in isolation (let alone as an invisible backend) have turned into an absolute mess, 5 generics everywhere, super long trait bounds in hundreds of locations, very messy. It is also causing me to miss optimization opportunities because everything is so silo'd / compartmentalized, it's hard to reuse memory that's already been transferred etc.
I think we should consider simplifying how the CMA-ES crate is implemented. I think we could redesign with less total code, probably fewer specialized structs, etc.
Another option is to fork into two crates which maintain the same public API, cmaes_cpu
and cmaes_arrayfire
or something, and then cmaes
just publishes everything from one of the two based on a feature flag or something.
Do not think I can ever merge what I have in with what exists, like every single struct needs to be made generic over numeric type etc, and all of the internals are touched by the changes.
Making the crate generic over numeric type can be done separately, as it's a feature I'm interested in having even for the pure CPU version.
Having two separate crates could be a good solution. I'm not sure I want to change the design of the existing code, but if the ArrayFire code requires too many changes, it can simply be moved into a separate module or crate instead. It should be fairly simple to abstract over them as well, so it would be nearly seamless for the end-user.
Hello! It's me again!
I've been optimizing
eant2
, mostly just minor stuff like amortizing allocations and increasing cache hits, parallelism, etc.The real meat of the work of eant2 is done in CMA-ES land.
What portion of that is matrix multiplication & other parallel operations that would benefit from integration with
rust-gpu
and the like?I'm interested in running "big" problems through
eant2
on my consumer machine, and I'd like to understand the distribution of time spent insidecma-es
. Is it known? Do you have any information on where time is spent? First thing that came to mind is GPU acceleration of course.Thank you.