Open wbrickner opened 2 years ago
This is strange indeed. What is the panic message for each one? I'll try to reproduce the issue locally in the meantime.
Alright, I tracked down the issue to the identify_recurrence
function. Its calling enumerate
before deduplication, resulting in invalid indices into self.persistence
. All this work is already being done by cge
internally though, so I'll just expose it through a Network
method for you so you don't have to duplicate any logic.
cge
0.1.1 is released and has a Network::recurrent_state_ids
method to replace the logic in identify_recurrence
. The first neuron ID in the list should correspond to self.persistence[0]
, the second to self.persistence[1]
, and so on, which you can still build a HashMap
for.
It seems that during the refactor, repeated pop
calls were replaced by drain
when popping items off the stack, which reversed the order in which they are summed. If you want to match cge
exactly, you can just reverse the order of your sum statements as well.
Oh wow, yes! Thank you
That solved part of the problem, but the with_extra_data_v1.cge
still gives trouble.
The reason for this is that the default backend used by const_cge
is libm
, and the tanh
implementation is not bit-identical to the std
version, which cge
uses.
I consider this resolved.
Hello, while testing with the randomly generated networks, I get an issue with my input sizing.
CGE analysis says the length needs to be 9
, but index 9
is accessed resulting in panic.
let input_count = {
// The number of inputs to a network is the number of unique input IDs
// found among all Input genes in the genome.
let mut input_ids = HashSet::new();
network
.genome()
.iter()
.filter_map(|g| match g {
Gene::Input(i) => Some(i.id()),
_ => None
})
.for_each(|id| { input_ids.insert(id.as_usize()); });
input_ids.len()
};
As you can see, inputs[9]
is accessed, which is incorrect.
As suspected, if you grep for inputs[
, isolate the access, and sort:
There is a conspicuous lack of access to inputs[5]
. The input IDs are not compact. Is this possible behavior? Have I messed something up in codegen? Is the CGE file invalid?
Essentially, are the input IDs supposed to be compact? Is there any harm in me compactifying them? Does this mean input number 5 is not connected to the interior of the network, and I should respect the original IDs, taking the highest ID as the length, rather than the number of unique IDs?
Thank you
The input IDs are not compact, so the network is supposed to have 10 inputs. If you want to guarantee that const_cge
matches cge
behavior, you can just get the number of inputs to the network by calling Network::num_inputs
.
The reason for this is that you might have a network that logically has some number of inputs, but just lacks connections to one of them. In this case, compacting the input IDs would be possible, but it would change the behavior of the network and so isn't currently done.
Agreed, thank you.
Hey @pengowen123, what is the status of the EANT2 merge?
I'm a little out of focus with respect to what remains to be completed, Is there anything I can help with? Can we use this as a tracking issue?
I'm super excited about using it & publishing pre-trained models on crates, apologies for hounding you about it! 😄
No worries! I've been working on getting a minimal, working version of the library put together. I initially thought it would simply be a matter of porting the crate to the new cge
version, but it turned out to be far more complicated than that. Several major parts of the original code were simply wrong, and the original paper was ambiguous in some important areas as well, so I've had to do more research and redesign a lot. However, I have made steady progress and it seems to be working well now from my testing. I just have a bit more to finish up and I'll git push
the first usable version soon.
Okay, all the changes should be pushed. I'll hold off on the actual crates.io release until things are more polished/tested, but feel free to test the master branch on any problems you can think of.
As for the overall state of the library, it should be pretty much fully functional. All the important parts are there, and from my testing everything is working well. The most immediate issues to fix are the lack of real tests and some code organization/quality issues. Then, some API changes are in order as I figure out what makes sense to be customizable and streamline common usage. Finally, some more complete examples that exercise more of the algorithm's power would be nice to have and provide additional data for choosing the default configuration. This is something I'd appreciate help with if you'd like to implement some new examples.
Incredible! Thank you so much for all your hard work. I will clone and begin working tonight and tomorrow.
Having trouble getting pendulum or mountain car to work at all.
Any luck or tips? Unclear if it's gym_rs or eant2. A lot of the notes in the implementations in gym are wrong, so I'm not sure.
Ok so mountain car works fine (eventually), pendulum seems to be intractable, but I am a novice at actually using eant2
.
I am now trying to train an MNIST classifier. It is very very very slow. Below is my flamegraph, sampled at 1khz over several minutes:
almost 100% of the time for each run is in CMAES<F>::run
, which curiously only spends about 30% of its time inside sample_internal
, which (I think) is the part that actually computes the objective function, which I would expect to be the most expensive part (784 dim input, 10dim output, over 128 samples + compute crossentropy loss etc), and integral to the CMAES optimization loop. I've read the implementation for run
, I don't understand why 60%+ of the time is spent prior to reaching the evaluation stage. It is not a one-time startup cost, the call stack is self-similar on all levels, same pattern of time usage.
I have reduced the population size to 5, and only doing one sample. Why would things still be so slow? Is there a nonlinear slowdown in parameter count I don't understand? The linear algebra workload seems to be almost negligible (~3% total).
Code is a bit difficult to understand, very recursive & lots of implementation indirection. Any clue what run
is doing and how one might optimize?
I've had the same issue with the pendulum problem, so I'll look into it to make sure it's not an issue with eant2
.
sample_internal
is indeed where the objective function is evaluated. You can try benchmarking your objective function for the MNIST classifier to get some measurements, but I wouldn't expect it to be more expensive than the linear algebra used in CMA-ES. Either way, I'm not entirely sure why your flamegraph looks like that, but you can try swapping out the parallel iterator in Generation::update_generation
for a sequential one and making sure debug info is enabled to see if it makes it clearer.
As for why it's so slow, CMA-ES scales poorly with such large problem dimensions. Several NxN by Nx1 matrix multiplications happen per sampled point, and an NxN eigendecomposition happens every few iterations, where N is the number of dimensions/parameters being optimized. From previous benchmarks, these two operations seem to take up nearly all the algorithm-internal time of CMA-ES. The performance of the cmaes
crate can likely be improved significantly here, but probably not by enough to tackle the MNIST problem as currently implemented. For a sense of scale, the initial networks generated by EANT2 for 784 inputs and 10 outputs will have an average of 3920 parameters for CMA-ES to optimize, while the largest problems I've seen tested with CMA-ES have about 1000 parameters.
The best solution to this is probably to split the task up, giving feature recognition and other tasks with large numbers of inputs to a different algorithm, and only passing the data to EANT2 once it has been pre-processed to reduce the dimension count. This is the approach used in the visual servoing tasks described in the various papers on EANT2.
Very interesting information, thank you.
I agree that the linear algebra stuff ought to be the slowest part (4000x4000 matrix operations). I will investigate the profiling technique, as the flamegraphs I collect say otherwise.
I'd like to play around with an arrayfire
implementation of the matrix operations, hopefully improving CMAES performance a lot.
I'm thinking many concurrent CMAES runs can all submit matrix multiplications / decompositions to the GPU and be served in a queue, with opportunistic concurrency on the GPU device/s.
Nice thing about arrayfire (don't trust me, I have fallen for their advertising) is it's sort of like writing CUDA code but it ought to be portable to Intel integrated graphics GPUs as well as Qualcomm Android GPUs, etc, plus we've got a reasonable rust API.
By the way, mountain car works for me. Pendulum still does not
Hey, so I'm looking into CMAES internals and noticing a couple things:
Making cmaes
generic over float precision should be fairly simple, and it would be a good feature to implement for the next release.
The main matrix operations are for point sampling, for updating the eigendecomposition, and for updating the covariance matrix. Point sampling involves NxN matrix multiplications, while updating the covariance matrix requires scalar multiplication of NxN matrices and a sum at the end. There are also scalar multiplications and a sum of Nx1 matrices for the mean update, but this should be comparatively minor.
The mountain car problem was not solvable for me until I adjusted the error calculation to be based on the car's position, as the metric provided by gym_rs
seems to be all-or-nothing, which is difficult to optimize in general. But once the metric was changed, a solution could be found consistently within a single generation. Did you have to make similar adjustments when you tried it?
I didn't make any adjustments to reward, I did need to implement maximum episode length etc.
Struggling to make it any faster than it is in the benchmarks.
Question: why are we doing matrix multiplications at all? I see we generate a matrix of i.i.d normally distributed noise, then transform that noise using the covariance transform matrix.
Couldn't we just stretch and shift the individual noise components of each column (each parameter vector)?
As far as I know, that operation is not simplifiable. The function of the multiplications is to sample points from a multivariate normal distribution, and a quick search doesn't reveal any better algorithms for doing that.
Also note that to perform operations on the GPU, the data must first be transferred to it, which probably adds considerable overhead. However, this overhead should become less significant relative to the operation speedup as the problem dimension increases, so you should compare benchmarks for N > ~100-300 to get an accurate idea of the benefits.
Your code for the mountain car problem works for me, but curiously it seems to produce decent solutions consistently in only a single generation. Further generations only improve the speed at which the goal is reached. So, I would consider this problem a success.
Also, I managed to get the pendulum problem working. Adjusting the fitness calculation makes a large difference here as well. I also turned off the gene_deviations
calculations temporarily, as I thought it might lead to stagnation, but I have not validated this hypothesis yet. Here's the complete code for it:
It still takes quite some time to run (sometimes more than 50 generations), but it's fairly consistent at solving the problem:
Yes, I am sure the parallelism will help as problem size becomes large. I was getting really awful performance with my implementation (using the iGPU, unified memory), first time trying something like this.
benches
?problem_dimension x population_size
(column x row) can be done in parallel by arrayfire as a built-in (I think it happens on the GPU, unclear), and then I would like to perform many independent matrix-vector multiplications, each column of matrix Z
(random data) independently matrix multiplied with all of matrix T
(transform). I think you can actually do this with a single matrix multiply & some transposition, but I'm not sure how to get it to work.Do you think it would be acceptable to replace all of the linear algebra representations in the crate with arrayfire stuff, assuming we can actually get a massive speedup in the GPU case? Arrayfire falls back to CPU implementations, supports OpenCL and CUDA, and has really broad compatibility. However, it requires that ArrayFire binaries be installed on the machine to function at all.
Ok, I think I have realized that no transpose is required and you can just do a single matrix multiplication. Does this look correct?
transformed
column vectors = covariance transform * random matrix
┌0┬─┬─┬─┬─┬─┬─┬─┐ 0─1─2─3─┬─┬─┬─┬─┬─┬─┬─┐ ┌0┬─┬─┬─┬─┬─┬─┬─┐
├1┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├1┼─┼─┼─┼─┼─┼─┼─┤
├2┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├2┼─┼─┼─┼─┼─┼─┼─┤
├3┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├3┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
▲ ├─┼─┼─┼─┼─┼─┼─┼─┤ ▲ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
│ ├─┼─┼─┼─┼─┼─┼─┼─┤ │ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
│ └─┴─┴─┴─┴─┴─┴─┴─┘ │ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤
BxP────▶ BxD────▶ │ ├─┼─┼─┼─┼─┼─┼─┼─┤
│ └─┴─┴─┴─┴─┴─┴─┴─┘
DxP────▶
The random matrix
contains column vectors (individuals, each with D components) which are multiplied by the rows of the covariance transform matrix, and end up occupying the corresponding column in the result I think.
Do I have the dimensions right in the figure? That we end up with a BxP matrix? I am going off the fact that the cov_transform
method claims it provides a B x D
matrix, and I know I am supplying a DxP
matrix (P is population, D is dimension, and I don't know what B is!).
This is what I have so far, although I don't think both it and the picture can be correct, as we ought to end up with vectors of length D
, no?
/// Shared logic between `sample` and `sample_parallel`
fn sample_internal<
P: Fn(Vec<DVector<f64>>, &mut F) -> Result<Vec<EvaluatedPoint>, InvalidFunctionValueError>,
>(
&mut self,
state: &State,
mode: Mode,
_parallel_update: bool,
evaluate_points: P,
) -> Result<Vec<EvaluatedPoint>, InvalidFunctionValueError> {
//
// I found it very confusing at first, but have realized you can in fact accomplish
// the entire task of many independent column-vector-by-matrix multiplications in a
// single matrix multiplication:
//
// transformed
// column vectors = covariance transform * random matrix
// ┌0┬─┬─┬─┬─┬─┬─┬─┐ 0─1─2─3─┬─┬─┬─┬─┬─┬─┬─┐ ┌0┬─┬─┬─┬─┬─┬─┬─┐
// ├1┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├1┼─┼─┼─┼─┼─┼─┼─┤
// ├2┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├2┼─┼─┼─┼─┼─┼─┼─┤
// ├3┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├3┼─┼─┼─┼─┼─┼─┼─┤
// ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
// ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
// ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
// ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤ ▲ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
// │ ├─┼─┼─┼─┼─┼─┼─┼─┤ │ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┤
// │ └─┴─┴─┴─┴─┴─┴─┴─┘ │ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤
// BxP────▶ BxD────▶ │ ├─┼─┼─┼─┼─┼─┼─┼─┤
// │ └─┴─┴─┴─┴─┴─┴─┴─┘
// DxP────▶
//
//
// i.i.d normally-distributed matrix
let random_matrix = randn::<f32>(dim4!(self.dim as u64, self.population_size as u64));
// covariance transform matrix
let buffer = {
let ct = state.cov_transform();
let mut buff = F32_BUFFER_POOL.pull(|| Vec::with_capacity(ct.len()));
if buff.len() < ct.len() { buff.resize(ct.len(), 0.0); }
// copy into buffer (nalgebra promises it is column-major, so this is safe)
ct.as_slice().iter()
.zip(buff.iter_mut())
.for_each(|(source, target)| *target = *source as f32);
buff
};
let covariance_transform = Array::new(
&buffer[..],
dim4!(state.cov_transform().ncols() as u64, state.cov_transform().nrows() as u64)
);
// matrix, the columns of which are our transformed vectors.
let transformed_vectors = matmul(&covariance_transform, &random_matrix, MatProp::NONE, MatProp::NONE);
let tv_size = transformed_vectors.dims().elements() as usize;
let points = {
// copy the matrix back to host device
let mut local_buffer = F32_BUFFER_POOL.pull(|| Vec::with_capacity(tv_size));
if local_buffer.len() < tv_size { local_buffer.resize(tv_size, 0.0); }
transformed_vectors.host(&mut local_buffer[..]);
let mut local_buffer_f64 = F64_BUFFER_POOL.pull(|| Vec::with_capacity(tv_size));
local_buffer_f64.clear();
for x in local_buffer.iter() { local_buffer_f64.push(*x as f64); }
// convert the matrix to a list of vectors
let rows = transformed_vectors.dims().get()[1] as usize;
let mut vector_list = Vec::with_capacity(self.population_size);
for chunk in local_buffer_f64.chunks(rows) {
vector_list.push(DVector::from_column_slice(chunk));
}
vector_list
};
// Evaluate and rank points
let mut points = evaluate_points(points, &mut self.objective_function)?;
self.function_evals += points.len();
points.sort_by(|a, b| mode.sort_cmp(a.value, b.value));
Ok(points)
}
(All of the redundant copying / precision conversion stuff would eventually go away)
Are there any large benchmarks in the cmaes benches?
There are not, but you can just add some with larger dimensions. I would recommend adding at the minimum a N=300, lambda=512*default
benchmark in order to properly test the sampling efficiency.
Do you think it would be acceptable to replace all of the linear algebra representations in the crate with arrayfire stuff, assuming we can actually get a massive speedup in the GPU case? Arrayfire falls back to CPU implementations, supports OpenCL and CUDA, and has really broad compatibility. However, it requires that ArrayFire binaries be installed on the machine to function at all.
For maximum efficiency this is probably necessary, but making it configurable through a feature somehow would be nice for compatibility (cmaes
by default is pure Rust with no external dependencies).
Ok, I think I have realized that no transpose is required and you can just do a single matrix multiplication. Does this look correct?
I see now, this does work with a single matrix multiplication. It seems you have the dimensions mixed up though; the covariance matrix is N x N
where N
is the problem dimension, and B
and D
simply refer to the eigenvector and eigenvalue matrices, respectively. But setting up a N x lambda
random matrix (where lambda
is the population size) and transforming it with cov_transform
should be valid, as it produces another N x lambda
matrix with properly transformed columns.
If you want to test your changes, cargo test
will run a fixed seed test to make sure the results haven't changed. However, this requires that the random numbers are generated in the same way. You can just temporarily change the random number generation to match the old version when running the tests, then switch back once you're sure everything is correct.
EDIT: There is one more major matrix multiplication happening here. It's a N x N
by N x 1
and N x 1
by 1 x N
multiplication for each sampled point. This one doesn't seem as easy to simplify, but the whole step can probably be done on the GPU.
Okay, it will probably be possible to make everything feature gated, but I think the best path to this will involve making the public types backend agnostic, e.g. pass the sample point as &[fxx]
rather than a DVector.
Will correct the issue with dimensions & add a new benchmark to validate the speedup.
Everything is working correctly, except for the restart::tests::test_fixed_seed
test, which disagrees with expectation on the number of function evals:
thread 'restart::tests::test_fixed_seed' panicked at 'assertion failed: `(left == right)`
left: `5790`,
right: `5778`', src/restart/mod.rs:708:9
#[test]
fn test_fixed_seed() {
let function = |x: &DVector<f64>| 1e-8 + (x[0] - 2.0).powi(2) + (x[1] - 1.0).powi(2);
let strategy = RestartStrategy::Local(Local::new(10, None).unwrap());
let seed = 96674803299116567;
let results = RestartOptions::new(2, -1.0..=1.0, strategy)
.seed(seed)
.build()
.unwrap()
.run(|| function);
assert_eq!(10, results.runs);
assert_eq!(5790, results.function_evals); // <== trouble!
assert_eq!(RestartTerminationReason::MaxRuns, results.reason);
let best = results.best.unwrap();
let eps = 1e-12;
assert_approx_eq!(1.0000000002890303e-8, best.value, eps);
assert_approx_eq!(2.00000000075140, best.point[0], eps);
assert_approx_eq!(0.9999999989799438, best.point[1], eps);
}
IDK why this is happening, I have some checks in place:
// ...
// convert the matrix to a list of vectors
let mut vector_list = Vec::with_capacity(self.population_size);
for chunk in local_buffer_f64.chunks_exact(self.dim) {
vector_list.push(DVector::from_column_slice(chunk));
}
// ensure the number of vectors (`points.len()`) is correct, and that the buffer size
// is what we expect.
debug_assert_eq!(local_buffer_f64.len(), self.dim * self.population_size);
debug_assert_eq!(vector_list.len(), self.population_size);
vector_list
};
// Evaluate and rank points
let mut points = evaluate_points(points, &mut self.objective_function)?;
self.function_evals += points.len();
Is the hard-coded value expected by the test actually correct? Why would they differ otherwise? The function eval count I get is deterministic, but disagrees with the expected value.
Also, just so I'm clear, the eigendecomposition of a matrix is a process: M => { eigenvalues, eigenvectors }
?
I'm fairly sure that arrayfire
is not actually using the iGPU, and I have not found a way to convince it to do what I want.
The assertion is effectively checking that the same number of iterations is taken to terminate, not that the number of function evals per iteration is the same. So, if any change to the algorithm is made (even a small one), that change will cause different behavior each iteration that eventually adds up to a different number of iterations and therefore function evaluations before termination. If you were to comment that assertion out, the other ones should fail as well because everything is slightly off.
This fixed seed test is a bit more sensitive than the one in tests/general.rs
due to running for more iterations. Does the general.rs
one pass? If it does, your changes are most likely correct and are only causing issues with this test due to the sensitivity and minor precision differences.
Yes, you are correct about the eigendecomposition.
So a rough idea of the sample_internal
speedup is now available.
For all tests:
nalgebra
is CPU, single thread.arrayfire
is GPU, with the overhead of copying and floating point conversions.nalgebra
: ~70us
per runarrayfire
: ~800us
per runnalgebra
: ~1.6ms
per runarrayfire
: ~1.8ms
per runnalgebra
: ~50ms
per runarrayfire
: ~13ms
per runnalgebra
: 140.991s
per runarrayfire
: 3.434s
per runThese numbers look impressive! Are they the overall per-iteration time, or are you only measuring Sampler::sample_internal
? Either way, the performance seems far better than the CPU version already, and I imagine it will improve further when all operations are converted.
They are inside one run of sample_internal. Not the whole CMAES iteration, that is much longer than this step.
Next step is translating everything to arrayfire. This is actually very hard. A lot of the things built into nalgebra are not available in arrayfire.
Also, I'm pretty sure arrayfire is underperforming right now. I don't know why, I have a pretty obvious usage.
The reason I believe this is that gpu.rocks/#benchmark gets remarkably better performance for a 4096² matrix multiply using webtech (100ms). I bet the performance relies in part on crazy stuff the chrome team has already done. I think my 4096² time is 470ms, I need to check again.
I have zero clue how well the current implementation would perform on something with more throughput, like a discrete GPU. I have an old 1050Ti, I'll check if CUDA works any better.
That's strange indeed. I'm not sure what could be causing that, but you could check the ArrayFire issue tracker to see if it's affecting anyone else at least.
The operations used in cmaes
should be available in ArrayFire, even if they have to be done in a slightly different way. Was there anything in particular you needed that was missing?
I have zero clue why perf is worse than GPU.js, I think it might be an apples to oranges comparison. I have a MacBook, and apple is very shamefully shunning OpenCL in favor of Metal. OpenCL is not getting patches or updates at all, and metal perf can be 40% better. It could be that chromium team has a layer between web shaders (that GPU.js makes) and Metal. It could be an issue in arrayfire, idk. Haven't seen anything about it online.
Main gripe so far is that everything is very high level and there is no escape hatch other than to write an OpenCL + Cuda "shader" by hand if what you need is missing.
High level isn't the problem, the problem is I'm not just doing linear algebra & random noise, I have 3 dimensional tensors that I need to perform operations on in a specific axis etc. A lot of what I need can be accomplished in an unintuitive way, but I wish they'd provide a nice clear syntax for declaring tensor operations unambiguously & flexibly.
I just need to dive into every part of CMAES linear algebra usage & see what it's actually trying to accomplish, and find how to translate that & skip work we don't need to be doing. For instance, there's some parts that tamper with the internals of the covariance matrix, trying to make it symmetric about the diagonal IIRC. This may not be necessary anymore, as you can communicate matrix properties to arrayfire & it may be able to take a lower triangular matrix & just read upper element values from lower element memory etc. Stuff like that. I think the more context given, the more it can optimize, as it ships with a JIT "kernel fusion" thing that basically builds a queue of operations & then only executes them at the last possible second with maximum context, allowing it to dynamically merge kernels together and automatically reuse memory in a mostly optimal way.
Right now I'm working on transforming the rank_mu_update
code & testing a hunch I have that can reduce the complexity a lot. Essentially, I think I can take all individuals.unscaled_step()
vectors together as a matrix, weight by a vector down the columns, sum element-wise across the rows, yielding a single vector, then take the matrix multiplication with it's transpose to yield a matrix. Although, this may not actually be equivalent, and it may be necessary to assemble many v x vT
matrices (yielding a very big 3D tensor), weight them all independently, then sum through the individuals axis. This would be a shame as it's vastly more FLOPs and the GPU parallelism runs out eventually & you must contend with the actual scaling complexity of the operation.
Regardless, it's not hard to parallelize the rank_mu step, even if it turns out it's not possible to simplify it.
Every single thing requires tons of copying though, and I think we should consider an implementation in which almost all data remains on the GPU, and modifications are informed by smaller pieces of data (objective function samples) that are sent back to the GPU, guiding the search.
Arrayfire has some interesting examples that got me thinking about this, specifically they have a genetic search algorithm that essentially lives entirely on the GPU. You can do sorting, selection, mutation, crossover, etc. Don't think we should take it that far, but if CMAES could be made almost fully GPU-native I'm sure it would be worth it.
Note: whenever I say GPU, substitute "compute device", arrayfire seamlessly transitions between CPU, GPU, and "accelerator hardware" (like, FPGA? Cool!) depending on availability.
I believe it is necessary to perform the rank-mu update as-is. If each vector is weighted and summed before the transpose and multiplication, the weight of the resulting matrix will effectively be the weight squared, as you would be multiplying (weight * step) * (weight * step).transpose()
for each vector. This is avoided in the current code by weighting only the original vector and not the transpose, but I'm not sure if you can replicate this when summing first (using weight.sqrt()
doesn't work here because some weights are negative).
I think it's worth considering a mostly GPU version of cmaes
, even if it requires writing custom shaders. The overall flow of CMA-ES is:
So, one potential design that maximizes GPU usage would be:
This way there are only two transfers of a N x lambda
matrix and one transfer of the state per iteration, and all linear algebra is done purely on the GPU (or other compute device if ArrayFire is used). This would be a lot more work than what you're doing now though, and I'm not sure how best to incorporate this kind of design into cmaes
, so it can just be left as a future plan for now.
So I think I have everything worked out in terms of the rank_mu update, from a performance perspective. Tests pass, but I'm not sure if that guarantees all the numbers are correct inside.
The only expensive operations remaining are the buffer copies, which occupy ~97% of execution time (usually about 50ms for each buffer!), and will go away if we switch to storing data on the GPU.
I estimate an N=128, lambda=128
single pass will end up taking something like 3.39 milliseconds
once the buffer overhead is gone. I'm pretty sure there are better ways to write / structure it that would improve performance a lot more.
The eigendecomposition work is all still nalgebra / CPU.
There are some issues I've encountered though, which pertain to GPU usage.
The GPU has finite memory! I cannot run the N=4096, lambda=4096
benchmark, I have only 1500mb video ram,
and my implementation accesses maximum parallelism at the expense of video memory usage.
The memory usage explodes because I am using 3D tensors in places, e.g. 4 bytes
`4096 4096 * 14,000=
939gb`.
This means we need to actively break down the process into a sequence of maximum manageable chunks, at least for the steps that use 3D tensors that can grow insanely large.
Even with that issue resolved, as problem dimension grows, the memory requirements grow at least as large as max(D^2, lambda * D) because of the matrices used. If things could be refactored / reformalized somehow to avoid this, that would be ideal.
For the rank_mu portion, the matrices we create are always going to be symmetric across the diagonal, so in principle only the lower portion needs to be constructed. I do not know how to specify this through arrayfire.
I don't want to end up creating a system that works for N=4096 / N=16,000 problems, but locks people out prematurely (they'll need an extremely expensive GPU with lots of memory). The fundamental space complexity requirements remain O(n^2)
unfortunately.
Tests pass, but I'm not sure if that guarantees all the numbers are correct inside.
If the fixed seed tests are passing, your changes almost certainly have not affected the algorithm's behavior aside from negligible rounding/precision differences.
The only expensive operations remaining are the buffer copies, which occupy ~97% of execution time (usually about 50ms for each buffer!), and will go away if we switch to storing data on the GPU.
The N * lambda
population matrix will need to be copied at least once to the CPU in order to pass the vectors to the objective function, and probably back to the GPU as well after sorting. So, there will still be some overhead to all this, but it should become less significant as N
increases (and it will depend on your GPU).
The GPU has finite memory! I cannot run the N=4096, lambda=4096 benchmark, I have only 1500mb video ram, and my implementation accesses maximum parallelism at the expense of video memory usage. The memory usage explodes because I am using 3D tensors in places, e.g. 4 bytes 4096 4096 * 14,000 = 939gb
I wouldn't expect memory to be an issue at this scale, as 4 bytes
`4096 4096is only ~
67MB. What is the
14,000for in your example? If it does turn out to be an issue though, the operation can freely be split up into any number of
N * Lmatrices, where all
Lvalues sum to
lambda`.
If the fixed seed tests are passing, your changes almost certainly have not affected the algorithm's behavior aside from negligible rounding/precision differences.
I think this is the case.
The N * lambda population matrix will need to be copied at least once to the CPU in order to pass the vectors to the objective function, and probably back to the GPU as well after sorting. So, there will still be some overhead to all this, but it should become less significant as N increases (and it will depend on your GPU).
Yes, the N * lambda
copy is required, but I think only a lambda
-length vector needs to be sent back to the GPU, which can function as a sorting guide for the columns of the matrix. The GPU can sort the matrix itself.
I wouldn't expect memory to be an issue at this scale, as 4 bytes 4096 4096 is only ~67MB. What is the 14,000 for in your example? If it does turn out to be an issue though, the operation can freely be split up into any number of N * L matrices, where all L values sum to lambda.
This is the plan, although finding the maximum manageable stack size is not really possible, as I can't get information about device memory size. The C/C++ API provides this, but the rust bindings seem to have forgotten it. The 14,000 comes from N=4096, Lambda=4096 default, so you have 4096 default matrices, each 4096x4096, each float 4 bytes, it gets big in the rank_mu step.
Previously, this is what I was testing for performance, as in the earlier step (transform matrix etc) you only get an N * lambda, which is ~67mb as you say.
Oh, I see now. The two main options I can think of are either:
I think the second option is better in most cases, and switching to the first shouldn't be too difficult if this doesn't work out.
Yeah, I'd like to use a hybrid approach wherein we still leverage matrix-level parallelism AND many-matrix parallelism, but never overflow memory by chunking up the work.
However, this is really hard to do efficiently in a multi-thread environment in which many CMAES optimization passes are running concurrently.
I think for now (and without ability to inspect device memory size!) I'll leave it as-is.
Turns out there were a few bugs in my original implementation that are resolved now.
Will try to tackle eigendecomposition next. Arrayfire provides you a QR decomposition algorithm built-in, and I understand how to use that to get eigenvalues and eigenvectors (I think).
What I don't know exactly how to do is assess convergence of the decomposition. A lot of the code I've read just blindly iterates some number of times, whereas the original nalgebra
function dynamically examines convergence & compares it to a tolerance parameter. It's unclear exactly how this is being done.
There are also many different flavors of eigendecomposition, and I'm not sure what properties QR has / lacks.
Any input on this?
Here is a visualization I stumbled on today:
https://user-images.githubusercontent.com/2095902/174763068-4a881f06-b1a8-42f3-ae08-b7050a09f191.mp4
The left / right matrices are random and symmetric respectively, I think.
When we talk about decomposition convergence, does this refer to the magnitude of the non-diagonal components? It's difficult to find straight answers about this on the web.
Also, I'm thinking it may make sense to provide mutually exclusive implementations for subcomponents of CMAES that are then conditionally compiled by feature flags. The changes required for a good version of arrayfire implementation are far-reaching, e.g. changes to lots of different structs to store a pointer to GPU memory. Your existing implementation will remain default (if you want), and be guarded by #[cfg(not(feature = "arrayfire")))]
, and the arrayfire implementation will exist behind #[cfg(feature = "arrayfire")]
. We can then mirror this flag setup in eant2
.
In order to hide the implementation completely, the CMAES public API should no longer make use of nalgebra
constructs like DVector
, hopefully opting for &[f32]
or &[f64]
, or even Reusable<Vec<f32>>
etc.
Another option, hypothetically, is to dynamically choose between a backend based on problem size. The nalgebra implementation is a lot faster than arrayfire for small sizes. The optimal threshold to switch to arrayfire will depend massively on the hardware: CPU, GPU(s), cache sizes + memory hierarchy, memory speed, OpenCL version, data type, etc etc etc.
I was thinking (about something I don't want to build IRL) a persistent store that makes the assumption that there is a partition on problem sizes, a single critical point beyond which arrayfire is always better for performance, and it measures this point at first execution, so we know the optimal switch point tuned for the hardware for all future runs.
I'm not familiar with QR decompositions unfortunately. From what I've read, it seems that convergence has something to do with the magnitudes of non-diagonal elements, but I'm not sure exactly what the relationship is.
At some point this will probably be the best solution, as very little would actually be shared between the existing code and the ArrayFire code. Automatically determining which backend to use is probably outside the scope of the crate, but the provided benchmarks should be enough to help users determine which to use for their particular problem.
Hello!
I'm very excited about using this library, however the README claims it is in a broken state, waiting for fixes in the CMA-ES repo. I see the CMA-ES repo is more active, with many commits recently.
Is the README still accurate in saying this crate is broken?
Amazing work, thank you so much!