pengowen123 / cmaes

A Rust implementation of the CMA-ES optimization algorithm.
Apache License 2.0
32 stars 6 forks source link

Utilize GPU for matrix multiplication? #7

Open wbrickner opened 2 years ago

wbrickner commented 2 years ago

Hello! It's me again!

I've been optimizing eant2, mostly just minor stuff like amortizing allocations and increasing cache hits, parallelism, etc.

The real meat of the work of eant2 is done in CMA-ES land.

What portion of that is matrix multiplication & other parallel operations that would benefit from integration with rust-gpu and the like?

I'm interested in running "big" problems through eant2 on my consumer machine, and I'd like to understand the distribution of time spent inside cma-es. Is it known? Do you have any information on where time is spent? First thing that came to mind is GPU acceleration of course.

Thank you.

pengowen123 commented 2 years ago

Generally, computing the eigendecomposition is the most expensive part of the algorithm (~30% to ~50% of the total), and matrix multiplications are usually far less (~10%). However, this changes when using very large population sizes, which is frequently done by the IPOP and BIPOP restart strategies. With large population sizes, the various matrix multiplications can take up about a third of the running time.

These numbers only include the time spent in CMA-ES itself and do not account for a potentially expensive objective function. They also can vary greatly with different problem dimensions. Also, they were only measured on my machine using the default settings. There is a setting to run most of the matrix multiplications in parallel, which can be a major performance boost with very large population sizes. If you want to explore the distribution of running time further, you can check out the flamegraphs I generated on the simple example with cargo-flamegraph or generate your own (you have to download them and open them in your browser): flamegraph_10d flamegraph_30d flamegraph_30d_large_popsize

Offloading matrix multiplications to the GPU could be a large performance boost, and it looks like there are a fair number of other opportunities for optimization as well (mostly allocations and perhaps swapping the RNG out for a faster one).

wbrickner commented 2 years ago

Ok so I'm going to use this issue to continue our discussion over from eant2 now that my work primarily concerns GPU acceleration of CMA-ES.

I'm almost finished, and there should only be a couple copies that are absolutely required if we are not allowed to assume the user's objective function also uses the GPU.

The GPU programming part is actually pretty straightforward, and some of the refactor to keep all memory on the GPU is straightforward.

However, even to support the changes I've made in isolation (let alone as an invisible backend) have turned into an absolute mess, 5 generics everywhere, super long trait bounds in hundreds of locations, very messy. It is also causing me to miss optimization opportunities because everything is so silo'd / compartmentalized, it's hard to reuse memory that's already been transferred etc.

I think we should consider simplifying how the CMA-ES crate is implemented. I think we could redesign with less total code, probably fewer specialized structs, etc.

wbrickner commented 2 years ago

Another option is to fork into two crates which maintain the same public API, cmaes_cpu and cmaes_arrayfire or something, and then cmaes just publishes everything from one of the two based on a feature flag or something.

Do not think I can ever merge what I have in with what exists, like every single struct needs to be made generic over numeric type etc, and all of the internals are touched by the changes.

pengowen123 commented 2 years ago

Making the crate generic over numeric type can be done separately, as it's a feature I'm interested in having even for the pure CPU version.

Having two separate crates could be a good solution. I'm not sure I want to change the design of the existing code, but if the ArrayFire code requires too many changes, it can simply be moved into a separate module or crate instead. It should be fairly simple to abstract over them as well, so it would be nearly seamless for the end-user.