Is this crate still broken?

wbrickner commented 2 years ago

Hello!

I'm very excited about using this library, however the README claims it is in a broken state, waiting for fixes in the CMA-ES repo. I see the CMA-ES repo is more active, with many commits recently.

Is the README still accurate in saying this crate is broken?

Amazing work, thank you so much!

pengowen123 commented 2 years ago

This is strange indeed. What is the panic message for each one? I'll try to reproduce the issue locally in the meantime.

pengowen123 commented 2 years ago

Alright, I tracked down the issue to the identify_recurrence function. Its calling enumerate before deduplication, resulting in invalid indices into self.persistence. All this work is already being done by cge internally though, so I'll just expose it through a Network method for you so you don't have to duplicate any logic.

pengowen123 commented 2 years ago

cge 0.1.1 is released and has a Network::recurrent_state_ids method to replace the logic in identify_recurrence. The first neuron ID in the list should correspond to self.persistence[0], the second to self.persistence[1], and so on, which you can still build a HashMap for.

pengowen123 commented 2 years ago

It seems that during the refactor, repeated pop calls were replaced by drain when popping items off the stack, which reversed the order in which they are summed. If you want to match cge exactly, you can just reverse the order of your sum statements as well.

wbrickner commented 2 years ago

Oh wow, yes! Thank you

wbrickner commented 2 years ago

That solved part of the problem, but the with_extra_data_v1.cge still gives trouble.

The reason for this is that the default backend used by const_cge is libm, and the tanh implementation is not bit-identical to the std version, which cge uses.

I consider this resolved.

wbrickner commented 2 years ago

Hello, while testing with the randomly generated networks, I get an issue with my input sizing. CGE analysis says the length needs to be 9, but index 9 is accessed resulting in panic.

Potentially wrong rust code

let input_count = {
    // The number of inputs to a network is the number of unique input IDs
    // found among all Input genes in the genome.
    let mut input_ids = HashSet::new();

    network
      .genome()
      .iter()
      .filter_map(|g| match g {
        Gene::Input(i) => Some(i.id()),
        _ => None
      })
      .for_each(|id| { input_ids.insert(id.as_usize()); });

    input_ids.len()
  };

CGE + codegen

Input CGE file

```json { "version": "1", "network": { "metadata": { "description": "A randomly-generated network." }, "activation": "linear", "genome": [ { "kind": "neuron", "id": 0, "num_inputs": 4, "weight": 1.0 }, { "kind": "recurrentjumper", "source_id": 9, "weight": 0.541 }, { "kind": "recurrentjumper", "source_id": 5, "weight": -0.796 }, { "kind": "neuron", "id": 5, "num_inputs": 7, "weight": 0.873 }, { "kind": "recurrentjumper", "source_id": 19, "weight": 0.038 }, { "kind": "forwardjumper", "source_id": 8, "weight": 0.694 }, { "kind": "forwardjumper", "source_id": 23, "weight": -0.288 }, { "kind": "neuron", "id": 26, "num_inputs": 2, "weight": -0.659 }, { "kind": "bias", "value": 0.506 }, { "kind": "input", "id": 0, "weight": -0.561 }, { "kind": "recurrentjumper", "source_id": 22, "weight": 0.502 }, { "kind": "bias", "value": -0.651 }, { "kind": "neuron", "id": 9, "num_inputs": 3, "weight": -0.967 }, { "kind": "recurrentjumper", "source_id": 15, "weight": -0.506 }, { "kind": "neuron", "id": 13, "num_inputs": 1, "weight": 0.066 }, { "kind": "recurrentjumper", "source_id": 12, "weight": 0.121 }, { "kind": "bias", "value": -0.465 }, { "kind": "neuron", "id": 3, "num_inputs": 6, "weight": 0.841 }, { "kind": "recurrentjumper", "source_id": 4, "weight": 0.285 }, { "kind": "recurrentjumper", "source_id": 7, "weight": 0.824 }, { "kind": "recurrentjumper", "source_id": 4, "weight": -0.957 }, { "kind": "bias", "value": -0.985 }, { "kind": "bias", "value": 0.203 }, { "kind": "neuron", "id": 6, "num_inputs": 2, "weight": -0.802 }, { "kind": "recurrentjumper", "source_id": 26, "weight": -0.464 }, { "kind": "input", "id": 8, "weight": -0.729 }, { "kind": "neuron", "id": 1, "num_inputs": 7, "weight": 1.0 }, { "kind": "neuron", "id": 20, "num_inputs": 3, "weight": 0.982 }, { "kind": "input", "id": 3, "weight": 0.83 }, { "kind": "input", "id": 9, "weight": -0.678 }, { "kind": "forwardjumper", "source_id": 11, "weight": 0.358 }, { "kind": "neuron", "id": 16, "num_inputs": 3, "weight": -0.351 }, { "kind": "recurrentjumper", "source_id": 3, "weight": -0.388 }, { "kind": "input", "id": 9, "weight": -0.813 }, { "kind": "input", "id": 0, "weight": -0.582 }, { "kind": "neuron", "id": 15, "num_inputs": 5, "weight": -0.921 }, { "kind": "bias", "value": -0.75 }, { "kind": "bias", "value": -0.603 }, { "kind": "input", "id": 6, "weight": 0.201 }, { "kind": "input", "id": 6, "weight": 0.828 }, { "kind": "input", "id": 3, "weight": -0.865 }, { "kind": "input", "id": 8, "weight": -0.013 }, { "kind": "bias", "value": -0.461 }, { "kind": "neuron", "id": 2, "num_inputs": 12, "weight": 0.864 }, { "kind": "input", "id": 4, "weight": -0.342 }, { "kind": "neuron", "id": 25, "num_inputs": 3, "weight": -0.726 }, { "kind": "recurrentjumper", "source_id": 1, "weight": 0.678 }, { "kind": "bias", "value": -0.164 }, { "kind": "recurrentjumper", "source_id": 14, "weight": -0.183 }, { "kind": "neuron", "id": 17, "num_inputs": 5, "weight": -0.788 }, { "kind": "bias", "value": 0.355 }, { "kind": "neuron", "id": 22, "num_inputs": 1, "weight": 0.315 }, { "kind": "input", "id": 6, "weight": -0.403 }, { "kind": "recurrentjumper", "source_id": 13, "weight": -0.307 }, { "kind": "bias", "value": -0.484 }, { "kind": "recurrentjumper", "source_id": 3, "weight": -0.876 }, { "kind": "input", "id": 8, "weight": -0.893 }, { "kind": "input", "id": 8, "weight": 0.884 }, { "kind": "neuron", "id": 10, "num_inputs": 9, "weight": 0.442 }, { "kind": "neuron", "id": 27, "num_inputs": 2, "weight": 0.19 }, { "kind": "input", "id": 2, "weight": -0.429 }, { "kind": "input", "id": 7, "weight": 0.905 }, { "kind": "input", "id": 7, "weight": -0.441 }, { "kind": "bias", "value": -0.404 }, { "kind": "neuron", "id": 23, "num_inputs": 3, "weight": 0.681 }, { "kind": "recurrentjumper", "source_id": 14, "weight": 0.689 }, { "kind": "bias", "value": 0.755 }, { "kind": "input", "id": 3, "weight": 0.85 }, { "kind": "recurrentjumper", "source_id": 2, "weight": 0.399 }, { "kind": "neuron", "id": 11, "num_inputs": 5, "weight": 0.406 }, { "kind": "neuron", "id": 21, "num_inputs": 2, "weight": -0.6 }, { "kind": "input", "id": 1, "weight": -0.105 }, { "kind": "bias", "value": 0.31 }, { "kind": "neuron", "id": 18, "num_inputs": 4, "weight": -0.334 }, { "kind": "recurrentjumper", "source_id": 10, "weight": 0.077 }, { "kind": "input", "id": 2, "weight": 0.344 }, { "kind": "recurrentjumper", "source_id": 4, "weight": -0.527 }, { "kind": "bias", "value": 0.122 }, { "kind": "recurrentjumper", "source_id": 7, "weight": 0.322 }, { "kind": "neuron", "id": 14, "num_inputs": 3, "weight": 0.777 }, { "kind": "recurrentjumper", "source_id": 13, "weight": -0.358 }, { "kind": "recurrentjumper", "source_id": 14, "weight": 0.229 }, { "kind": "recurrentjumper", "source_id": 14, "weight": 0.745 }, { "kind": "recurrentjumper", "source_id": 9, "weight": 0.173 }, { "kind": "recurrentjumper", "source_id": 4, "weight": -0.971 }, { "kind": "input", "id": 4, "weight": -0.81 }, { "kind": "input", "id": 9, "weight": -0.085 }, { "kind": "neuron", "id": 7, "num_inputs": 7, "weight": -0.421 }, { "kind": "neuron", "id": 19, "num_inputs": 1, "weight": 0.919 }, { "kind": "input", "id": 7, "weight": -0.265 }, { "kind": "input", "id": 9, "weight": -0.348 }, { "kind": "input", "id": 4, "weight": -0.028 }, { "kind": "recurrentjumper", "source_id": 9, "weight": -0.708 }, { "kind": "recurrentjumper", "source_id": 6, "weight": -0.824 }, { "kind": "input", "id": 2, "weight": 0.886 }, { "kind": "input", "id": 6, "weight": -0.276 }, { "kind": "input", "id": 0, "weight": -0.842 }, { "kind": "input", "id": 9, "weight": -0.808 }, { "kind": "neuron", "id": 4, "num_inputs": 9, "weight": -0.767 }, { "kind": "recurrentjumper", "source_id": 5, "weight": 0.282 }, { "kind": "input", "id": 1, "weight": 0.663 }, { "kind": "neuron", "id": 24, "num_inputs": 1, "weight": -0.749 }, { "kind": "input", "id": 3, "weight": -0.391 }, { "kind": "neuron", "id": 12, "num_inputs": 3, "weight": -0.185 }, { "kind": "recurrentjumper", "source_id": 10, "weight": -0.965 }, { "kind": "input", "id": 9, "weight": 0.539 }, { "kind": "bias", "value": -0.014 }, { "kind": "neuron", "id": 8, "num_inputs": 3, "weight": 0.879 }, { "kind": "recurrentjumper", "source_id": 14, "weight": -0.566 }, { "kind": "bias", "value": 0.388 }, { "kind": "recurrentjumper", "source_id": 2, "weight": -0.595 }, { "kind": "recurrentjumper", "source_id": 1, "weight": -0.892 }, { "kind": "recurrentjumper", "source_id": 6, "weight": -0.678 }, { "kind": "input", "id": 1, "weight": 0.676 }, { "kind": "recurrentjumper", "source_id": 2, "weight": 0.648 }, { "kind": "recurrentjumper", "source_id": 2, "weight": -0.296 }, { "kind": "bias", "value": -0.781 }, { "kind": "input", "id": 1, "weight": 1.0 } ], "recurrent_state": [ -0.465, -0.358068, -0.265, -0.403, -1.189, 0.525, 1.8757860000000002, -0.009535000000000043, -0.05500000000000005, -0.19734200000000002, -1.4598391621400317, 0.0, 0.0, -4.191965500088, -0.669684464, -0.729 ], "extra": null } } ```

Output Codegen

```rust pub fn evaluate(&mut self, inputs: &[f64; 9usize], outputs: &mut [f64; 2usize]) { let c0 = 1f64 * inputs[1usize]; let c1 = -0.781f64; let c2 = -0.296f64 * self.persistence[13usize]; let c3 = 0.648f64 * self.persistence[13usize]; let c4 = 0.676f64 * inputs[1usize]; let c5 = -0.678f64 * self.persistence[15usize]; let c6 = -0.892f64 * self.persistence[10usize]; let c7 = -0.595f64 * self.persistence[13usize]; let c8 = 0.388f64; let c9 = -0.566f64 * self.persistence[11usize]; let c10 = c7 + c8 + c9; let c10 = const_cge::activations::f64::linear(c10); let c11 = c10 * 0.879f64; let c12 = -0.014f64; let c13 = 0.539f64 * inputs[9usize]; let c14 = -0.965f64 * self.persistence[14usize]; let c15 = c12 + c13 + c14; let c15 = const_cge::activations::f64::linear(c15); let c16 = c15 * -0.185f64; let c17 = -0.391f64 * inputs[3usize]; let c18 = c17; let c18 = const_cge::activations::f64::linear(c18); let c19 = c18 * -0.749f64; let c20 = 0.663f64 * inputs[1usize]; let c21 = 0.282f64 * self.persistence[1usize]; let c22 = c3 + c4 + c5 + c6 + c11 + c16 + c19 + c20 + c21; let c22 = const_cge::activations::f64::linear(c22); let c23 = c22 * -0.767f64; let c24 = -0.808f64 * inputs[9usize]; let c25 = -0.842f64 * inputs[0usize]; let c26 = -0.276f64 * inputs[6usize]; let c27 = 0.886f64 * inputs[2usize]; let c28 = -0.824f64 * self.persistence[15usize]; let c29 = -0.708f64 * self.persistence[0usize]; let c30 = -0.028f64 * inputs[4usize]; let c31 = -0.348f64 * inputs[9usize]; let c32 = -0.265f64 * inputs[7usize]; let c33 = c32; let c33 = const_cge::activations::f64::linear(c33); let c34 = c33 * 0.919f64; let c35 = c26 + c27 + c28 + c29 + c30 + c31 + c34; let c35 = const_cge::activations::f64::linear(c35); let c36 = c35 * -0.421f64; let c37 = -0.085f64 * inputs[9usize]; let c38 = -0.81f64 * inputs[4usize]; let c39 = -0.971f64 * self.persistence[6usize]; let c40 = 0.173f64 * self.persistence[0usize]; let c41 = 0.745f64 * self.persistence[11usize]; let c42 = 0.229f64 * self.persistence[11usize]; let c43 = -0.358f64 * self.persistence[12usize]; let c44 = c41 + c42 + c43; let c44 = const_cge::activations::f64::linear(c44); let c45 = c44 * 0.777f64; let c46 = 0.322f64 * self.persistence[7usize]; let c47 = 0.122f64; let c48 = -0.527f64 * self.persistence[6usize]; let c49 = 0.344f64 * inputs[2usize]; let c50 = 0.077f64 * self.persistence[14usize]; let c51 = c47 + c48 + c49 + c50; let c51 = const_cge::activations::f64::linear(c51); let c52 = c51 * -0.334f64; let c53 = 0.31f64; let c54 = -0.105f64 * inputs[1usize]; let c55 = c53 + c54; let c55 = const_cge::activations::f64::linear(c55); let c56 = c55 * -0.6f64; let c57 = c40 + c45 + c46 + c52 + c56; let c57 = const_cge::activations::f64::linear(c57); let c58 = c57 * 0.406f64; let c59 = 0.399f64 * self.persistence[13usize]; let c60 = 0.85f64 * inputs[3usize]; let c61 = 0.755f64; let c62 = 0.689f64 * self.persistence[11usize]; let c63 = c60 + c61 + c62; let c63 = const_cge::activations::f64::linear(c63); let c64 = c63 * 0.681f64; let c65 = -0.404f64; let c66 = -0.441f64 * inputs[7usize]; let c67 = 0.905f64 * inputs[7usize]; let c68 = -0.429f64 * inputs[2usize]; let c69 = c67 + c68; let c69 = const_cge::activations::f64::linear(c69); let c70 = c69 * 0.19f64; let c71 = c37 + c38 + c39 + c58 + c59 + c64 + c65 + c66 + c70; let c71 = const_cge::activations::f64::linear(c71); let c72 = c71 * 0.442f64; let c73 = 0.884f64 * inputs[8usize]; let c74 = -0.893f64 * inputs[8usize]; let c75 = -0.876f64 * self.persistence[9usize]; let c76 = -0.484f64; let c77 = -0.307f64 * self.persistence[12usize]; let c78 = -0.403f64 * inputs[6usize]; let c79 = c78; let c79 = const_cge::activations::f64::linear(c79); let c80 = c79 * 0.315f64; let c81 = 0.355f64; let c82 = c75 + c76 + c77 + c80 + c81; let c82 = const_cge::activations::f64::linear(c82); let c83 = c82 * -0.788f64; let c84 = -0.183f64 * self.persistence[11usize]; let c85 = -0.164f64; let c86 = 0.678f64 * self.persistence[10usize]; let c87 = c84 + c85 + c86; let c87 = const_cge::activations::f64::linear(c87); let c88 = c87 * -0.726f64; let c89 = -0.342f64 * inputs[4usize]; let c90 = c1 + c2 + c23 + c24 + c25 + c36 + c72 + c73 + c74 + c83 + c88 + c89; let c90 = const_cge::activations::f64::linear(c90); let c91 = c90 * 0.864f64; let c92 = -0.461f64; let c93 = -0.013f64 * inputs[8usize]; let c94 = -0.865f64 * inputs[3usize]; let c95 = 0.828f64 * inputs[6usize]; let c96 = 0.201f64 * inputs[6usize]; let c97 = -0.603f64; let c98 = -0.75f64; let c99 = c94 + c95 + c96 + c97 + c98; let c99 = const_cge::activations::f64::linear(c99); let c100 = c99 * -0.921f64; let c101 = -0.582f64 * inputs[0usize]; let c102 = -0.813f64 * inputs[9usize]; let c103 = -0.388f64 * self.persistence[9usize]; let c104 = c101 + c102 + c103; let c104 = const_cge::activations::f64::linear(c104); let c105 = c104 * -0.351f64; let c106 = 0.173f64 * self.persistence[0usize]; let c107 = 0.745f64 * self.persistence[11usize]; let c108 = 0.229f64 * self.persistence[11usize]; let c109 = -0.358f64 * self.persistence[12usize]; let c110 = c107 + c108 + c109; let c110 = const_cge::activations::f64::linear(c110); let c111 = c110 * 0.777f64; let c112 = 0.322f64 * self.persistence[7usize]; let c113 = 0.122f64; let c114 = -0.527f64 * self.persistence[6usize]; let c115 = 0.344f64 * inputs[2usize]; let c116 = 0.077f64 * self.persistence[14usize]; let c117 = c113 + c114 + c115 + c116; let c117 = const_cge::activations::f64::linear(c117); let c118 = c117 * -0.334f64; let c119 = 0.31f64; let c120 = -0.105f64 * inputs[1usize]; let c121 = c119 + c120; let c121 = const_cge::activations::f64::linear(c121); let c122 = c121 * -0.6f64; let c123 = c106 + c111 + c112 + c118 + c122; let c123 = const_cge::activations::f64::linear(c123); let c124 = c123 * 0.358f64; let c125 = -0.678f64 * inputs[9usize]; let c126 = 0.83f64 * inputs[3usize]; let c127 = c124 + c125 + c126; let c127 = const_cge::activations::f64::linear(c127); let c128 = c127 * 0.982f64; let c129 = c0 + c91 + c92 + c93 + c100 + c105 + c128; let c129 = const_cge::activations::f64::linear(c129); let c130 = c129 * 1f64; let c131 = -0.729f64 * inputs[8usize]; let c132 = -0.464f64 * self.persistence[8usize]; let c133 = c131 + c132; let c133 = const_cge::activations::f64::linear(c133); let c134 = c133 * -0.802f64; let c135 = 0.203f64; let c136 = -0.985f64; let c137 = -0.957f64 * self.persistence[6usize]; let c138 = 0.824f64 * self.persistence[7usize]; let c139 = 0.285f64 * self.persistence[6usize]; let c140 = c134 + c135 + c136 + c137 + c138 + c139; let c140 = const_cge::activations::f64::linear(c140); let c141 = c140 * 0.841f64; let c142 = -0.465f64; let c143 = 0.121f64 * self.persistence[5usize]; let c144 = c143; let c144 = const_cge::activations::f64::linear(c144); let c145 = c144 * 0.066f64; let c146 = -0.506f64 * self.persistence[4usize]; let c147 = c142 + c145 + c146; let c147 = const_cge::activations::f64::linear(c147); let c148 = c147 * -0.967f64; let c149 = -0.651f64; let c150 = 0.502f64 * self.persistence[3usize]; let c151 = -0.561f64 * inputs[0usize]; let c152 = 0.506f64; let c153 = c151 + c152; let c153 = const_cge::activations::f64::linear(c153); let c154 = c153 * -0.659f64; let c155 = 0.85f64 * inputs[3usize]; let c156 = 0.755f64; let c157 = 0.689f64 * self.persistence[11usize]; let c158 = c155 + c156 + c157; let c158 = const_cge::activations::f64::linear(c158); let c159 = c158 * -0.288f64; let c160 = -0.595f64 * self.persistence[13usize]; let c161 = 0.388f64; let c162 = -0.566f64 * self.persistence[11usize]; let c163 = c160 + c161 + c162; let c163 = const_cge::activations::f64::linear(c163); let c164 = c163 * 0.694f64; let c165 = 0.038f64 * self.persistence[2usize]; let c166 = c148 + c149 + c150 + c154 + c159 + c164 + c165; let c166 = const_cge::activations::f64::linear(c166); let c167 = c166 * 0.873f64; let c168 = -0.796f64 * self.persistence[1usize]; let c169 = 0.541f64 * self.persistence[0usize]; let c170 = c141 + c167 + c168 + c169; let c170 = const_cge::activations::f64::linear(c170); let c171 = c170 * 1f64; outputs[0usize] = c130; outputs[1usize] = c171; self.persistence[5usize] = c15; self.persistence[6usize] = c22; self.persistence[2usize] = c33; self.persistence[7usize] = c35; self.persistence[11usize] = c44; self.persistence[14usize] = c71; self.persistence[3usize] = c79; self.persistence[13usize] = c90; self.persistence[4usize] = c99; self.persistence[10usize] = c129; self.persistence[15usize] = c133; self.persistence[9usize] = c140; self.persistence[12usize] = c144; self.persistence[0usize] = c147; self.persistence[8usize] = c153; self.persistence[1usize] = c166; } ```

As you can see, inputs[9] is accessed, which is incorrect.

As suspected, if you grep for inputs[, isolate the access, and sort:

Result

```rust inputs[0usize]; inputs[0usize]; inputs[0usize]; inputs[1usize]; inputs[1usize]; inputs[1usize]; inputs[1usize]; inputs[1usize]; inputs[2usize]; inputs[2usize]; inputs[2usize]; inputs[2usize]; inputs[3usize]; inputs[3usize]; inputs[3usize]; inputs[3usize]; inputs[3usize]; inputs[4usize]; inputs[4usize]; inputs[4usize]; inputs[6usize]; inputs[6usize]; inputs[6usize]; inputs[6usize]; inputs[7usize]; inputs[7usize]; inputs[7usize]; inputs[8usize]; inputs[8usize]; inputs[8usize]; inputs[8usize]; inputs[9usize]; inputs[9usize]; inputs[9usize]; inputs[9usize]; inputs[9usize]; inputs[9usize]; ```

There is a conspicuous lack of access to inputs[5]. The input IDs are not compact. Is this possible behavior? Have I messed something up in codegen? Is the CGE file invalid?

Essentially, are the input IDs supposed to be compact? Is there any harm in me compactifying them? Does this mean input number 5 is not connected to the interior of the network, and I should respect the original IDs, taking the highest ID as the length, rather than the number of unique IDs?

Thank you

pengowen123 commented 2 years ago

The input IDs are not compact, so the network is supposed to have 10 inputs. If you want to guarantee that const_cge matches cge behavior, you can just get the number of inputs to the network by calling Network::num_inputs.

The reason for this is that you might have a network that logically has some number of inputs, but just lacks connections to one of them. In this case, compacting the input IDs would be possible, but it would change the behavior of the network and so isn't currently done.

wbrickner commented 2 years ago

Agreed, thank you.

wbrickner commented 2 years ago

Hey @pengowen123, what is the status of the EANT2 merge?

I'm a little out of focus with respect to what remains to be completed, Is there anything I can help with? Can we use this as a tracking issue?

I'm super excited about using it & publishing pre-trained models on crates, apologies for hounding you about it! 😄

pengowen123 commented 2 years ago

No worries! I've been working on getting a minimal, working version of the library put together. I initially thought it would simply be a matter of porting the crate to the new cge version, but it turned out to be far more complicated than that. Several major parts of the original code were simply wrong, and the original paper was ambiguous in some important areas as well, so I've had to do more research and redesign a lot. However, I have made steady progress and it seems to be working well now from my testing. I just have a bit more to finish up and I'll git push the first usable version soon.

pengowen123 commented 2 years ago

Okay, all the changes should be pushed. I'll hold off on the actual crates.io release until things are more polished/tested, but feel free to test the master branch on any problems you can think of.

As for the overall state of the library, it should be pretty much fully functional. All the important parts are there, and from my testing everything is working well. The most immediate issues to fix are the lack of real tests and some code organization/quality issues. Then, some API changes are in order as I figure out what makes sense to be customizable and streamline common usage. Finally, some more complete examples that exercise more of the algorithm's power would be nice to have and provide additional data for choosing the default configuration. This is something I'd appreciate help with if you'd like to implement some new examples.

wbrickner commented 2 years ago

Incredible! Thank you so much for all your hard work. I will clone and begin working tonight and tomorrow.

wbrickner commented 2 years ago

Having trouble getting pendulum or mountain car to work at all.

Any luck or tips? Unclear if it's gym_rs or eant2. A lot of the notes in the implementations in gym are wrong, so I'm not sure.

wbrickner commented 2 years ago

Ok so mountain car works fine (eventually), pendulum seems to be intractable, but I am a novice at actually using eant2.

I am now trying to train an MNIST classifier. It is very very very slow. Below is my flamegraph, sampled at 1khz over several minutes:

flamegraph

almost 100% of the time for each run is in CMAES<F>::run, which curiously only spends about 30% of its time inside sample_internal, which (I think) is the part that actually computes the objective function, which I would expect to be the most expensive part (784 dim input, 10dim output, over 128 samples + compute crossentropy loss etc), and integral to the CMAES optimization loop. I've read the implementation for run, I don't understand why 60%+ of the time is spent prior to reaching the evaluation stage. It is not a one-time startup cost, the call stack is self-similar on all levels, same pattern of time usage.

I have reduced the population size to 5, and only doing one sample. Why would things still be so slow? Is there a nonlinear slowdown in parameter count I don't understand? The linear algebra workload seems to be almost negligible (~3% total).

Code is a bit difficult to understand, very recursive & lots of implementation indirection. Any clue what run is doing and how one might optimize?

pengowen123 commented 2 years ago

I've had the same issue with the pendulum problem, so I'll look into it to make sure it's not an issue with eant2.

sample_internal is indeed where the objective function is evaluated. You can try benchmarking your objective function for the MNIST classifier to get some measurements, but I wouldn't expect it to be more expensive than the linear algebra used in CMA-ES. Either way, I'm not entirely sure why your flamegraph looks like that, but you can try swapping out the parallel iterator in Generation::update_generation for a sequential one and making sure debug info is enabled to see if it makes it clearer.

As for why it's so slow, CMA-ES scales poorly with such large problem dimensions. Several NxN by Nx1 matrix multiplications happen per sampled point, and an NxN eigendecomposition happens every few iterations, where N is the number of dimensions/parameters being optimized. From previous benchmarks, these two operations seem to take up nearly all the algorithm-internal time of CMA-ES. The performance of the cmaes crate can likely be improved significantly here, but probably not by enough to tackle the MNIST problem as currently implemented. For a sense of scale, the initial networks generated by EANT2 for 784 inputs and 10 outputs will have an average of 3920 parameters for CMA-ES to optimize, while the largest problems I've seen tested with CMA-ES have about 1000 parameters.

The best solution to this is probably to split the task up, giving feature recognition and other tasks with large numbers of inputs to a different algorithm, and only passing the data to EANT2 once it has been pre-processed to reduce the dimension count. This is the approach used in the visual servoing tasks described in the various papers on EANT2.

wbrickner commented 2 years ago

Very interesting information, thank you.

I agree that the linear algebra stuff ought to be the slowest part (4000x4000 matrix operations). I will investigate the profiling technique, as the flamegraphs I collect say otherwise.

I'd like to play around with an arrayfire implementation of the matrix operations, hopefully improving CMAES performance a lot.

I'm thinking many concurrent CMAES runs can all submit matrix multiplications / decompositions to the GPU and be served in a queue, with opportunistic concurrency on the GPU device/s.

Nice thing about arrayfire (don't trust me, I have fallen for their advertising) is it's sort of like writing CUDA code but it ought to be portable to Intel integrated graphics GPUs as well as Qualcomm Android GPUs, etc, plus we've got a reasonable rust API.

wbrickner commented 2 years ago

By the way, mountain car works for me. Pendulum still does not

wbrickner commented 2 years ago

Hey, so I'm looking into CMAES internals and noticing a couple things:

There are matrix operations / multiplications all over the place. Which are of high dimension & repeatedly computed? I cant really tell. Replacing these ops with GPU ops is dead simple, but I want to know what to focus on.
We should make generic the floating point precision used. This is a good thing in general, and provides a performance boost if using f32, but it is critical for broad compatibility. For instance, my Macbook CPU's integrated graphics (Intel Iris GPU) provides a significant speedup over CPU operations, but has no support for double-precision.

pengowen123 commented 2 years ago

Making cmaes generic over float precision should be fairly simple, and it would be a good feature to implement for the next release.

The main matrix operations are for point sampling, for updating the eigendecomposition, and for updating the covariance matrix. Point sampling involves NxN matrix multiplications, while updating the covariance matrix requires scalar multiplication of NxN matrices and a sum at the end. There are also scalar multiplications and a sum of Nx1 matrices for the mean update, but this should be comparatively minor.

The mountain car problem was not solvable for me until I adjusted the error calculation to be based on the car's position, as the metric provided by gym_rs seems to be all-or-nothing, which is difficult to optimize in general. But once the metric was changed, a solution could be found consistently within a single generation. Did you have to make similar adjustments when you tried it?

wbrickner commented 2 years ago

I didn't make any adjustments to reward, I did need to implement maximum episode length etc.

Complete rust code that mostly works for mountain car

```rust use cge::encoding::{Metadata, WithRecurrentState}; use eant2::eant2::EANT2; use eant2::mutation_probabilities::MutationProbabilities; use eant2::options::{EANT2Termination, Exploration}; use eant2::{Activation, FitnessFunction, Network, NetworkView}; use gym_rs::{ActionType, CartPoleEnv, GifRender, GymEnv}; #[derive(Clone)] pub struct MyEnv; impl MyEnv { fn fitness(&self, net: &mut NetworkView) -> f64 { let mut env = gym_rs::MountainCarEnv::default(); let mut state: Vec = env.reset(); let mut total_reward = 0.0; let mut ep_len = 0; loop { let prediction = net.evaluate(&state).unwrap(); let action = if prediction[0] <= 0.0 { ActionType::Discrete(1) } else { if prediction[1] <= 0.0 { ActionType::Discrete(0) } else { ActionType::Discrete(2) } }; let (s, reward, done, _info) = env.step(action); total_reward += reward; state = s; if done || ep_len >= 4000 { break } ep_len += 1; } -total_reward } } impl FitnessFunction for MyEnv { fn fitness(&self, mut net: NetworkView) -> f64 { // Get the worst score of 10 sample runs (0..10) .collect::>() .iter() .map(|_| self.fitness(&mut net)) .max_by(|a, b| a.partial_cmp(b).unwrap()) .unwrap() } } fn render_champion(net: &mut Network) { println!("Rendering solution"); let mut render = GifRender::new(1024, 1024, "./test_output/gym_cart_pole_champion.gif", 1000 / 20).unwrap(); let mut env = gym_rs::MountainCarEnv::default(); let mut state: Vec = env.reset(); let mut ep_len = 0; loop { let prediction = net.evaluate(&state).unwrap(); let action = if prediction[0] <= 0.0 { ActionType::Discrete(1) } else { if prediction[1] <= 0.0 { ActionType::Discrete(0) } else { ActionType::Discrete(2) } }; let (s, reward, done, _info) = env.step(action); state = s; env.render(&mut render); if done || ep_len > 4000 { break } ep_len += 1; } } fn main() { // let meth = meth::Meth::new(); // Configure EANT2 let eant2 = EANT2::builder() .inputs(2) .outputs(2) .print() .activation(Activation::Tanh) .exploration( Exploration::builder() .terminate( EANT2Termination::builder() .fitness(0.05) .generations(8) .build() ) .build(), ) .build(); // Find a solution let (mut network, _) = eant2.run(&MyEnv); // Save it for later use or inspection println!("Saving solution to file"); network .to_file( Metadata::new(Some( "A solution network for the gym cart pole problem.".into(), )), (), WithRecurrentState(false), "./test_output/gym_cart_pole_champion.cge", true, ) .unwrap(); // Render the network performing the task to a GIF file render_champion(&mut network); } ```

wbrickner commented 2 years ago

Struggling to make it any faster than it is in the benchmarks.

Question: why are we doing matrix multiplications at all? I see we generate a matrix of i.i.d normally distributed noise, then transform that noise using the covariance transform matrix.

Couldn't we just stretch and shift the individual noise components of each column (each parameter vector)?

pengowen123 commented 2 years ago

As far as I know, that operation is not simplifiable. The function of the multiplications is to sample points from a multivariate normal distribution, and a quick search doesn't reveal any better algorithms for doing that.

pengowen123 commented 2 years ago

Also note that to perform operations on the GPU, the data must first be transferred to it, which probably adds considerable overhead. However, this overhead should become less significant relative to the operation speedup as the problem dimension increases, so you should compare benchmarks for N > ~100-300 to get an accurate idea of the benefits.

pengowen123 commented 2 years ago

Your code for the mountain car problem works for me, but curiously it seems to produce decent solutions consistently in only a single generation. Further generations only improve the speed at which the goal is reached. So, I would consider this problem a success.

Also, I managed to get the pendulum problem working. Adjusting the fitness calculation makes a large difference here as well. I also turned off the gene_deviations calculations temporarily, as I thought it might lead to stagnation, but I have not validated this hypothesis yet. Here's the complete code for it:

```rust use cge::encoding::{Metadata, WithRecurrentState}; use cmaes::restart::{RestartStrategy, Local}; use eant2::eant2::EANT2; use eant2::options::{EANT2Termination, Exploration, Exploitation}; use eant2::{Activation, FitnessFunction, Network, NetworkView}; use gym_rs::{ActionType, PendulumEnv, GifRender, GymEnv}; #[derive(Clone)] pub struct MyEnv; impl MyEnv { fn fitness(&self, net: &mut NetworkView) -> f64 { let mut env = PendulumEnv::default(); let mut state: Vec = env.reset(); let mut total_reward: f64 = 0.0; for _ in 0..150 { let output = net.evaluate(&state).unwrap(); let action = ActionType::Continuous(output.into()); let (s, _, done, _info) = env.step(action); if done { break; } // The difference between the target angle (straight up) and the current angle total_reward += -(1.0 - s[0]).abs(); state = s; } -total_reward } } impl FitnessFunction for MyEnv { fn fitness(&self, mut net: NetworkView) -> f64 { // Get the worst score of 10 sample runs (0..10) .collect::>() .iter() .map(|_| self.fitness(&mut net)) .max_by(|a, b| a.partial_cmp(b).unwrap()) .unwrap() } } fn render_champion(net: &mut Network) { println!("Rendering solution"); let mut env = PendulumEnv::default(); let mut state: Vec = env.reset(); let mut render = GifRender::new(270, 270, "./test_output/gym_cart_pole_champion.gif", 20).unwrap(); for _ in 0..300 { let output = net.evaluate(&state).unwrap(); let action = ActionType::Continuous(output.into()); let (s, _, done, _info) = env.step(action); if done { break; } state = s; env.render(&mut render); } } fn main() { // Configure EANT2 let eant2 = EANT2::builder() .inputs(3) .outputs(1) .print() .activation(Activation::Tanh) .exploration( Exploration::builder() .terminate(EANT2Termination::builder().fitness(0.01).generations(100).build()) .similarity(1.0) .population(10) .offspring(2) .build(), ) .exploitation( Exploitation::builder() .restart(RestartStrategy::Local(Local::new(1, Some(5e1)).unwrap())) .build(), ) .build(); // Find a solution let (mut network, _) = eant2.run(&MyEnv); // Save it for later use or inspection println!("Saving solution to file"); network .to_file( Metadata::new(Some( "A solution network for the gym cart pole problem.".into(), )), (), WithRecurrentState(false), "./test_output/gym_cart_pole_champion.cge", true, ) .unwrap(); // Render the network performing the task to a GIF file render_champion(&mut network); } ```

It still takes quite some time to run (sometimes more than 50 generations), but it's fairly consistent at solving the problem: pendulum

wbrickner commented 2 years ago

Yes, I am sure the parallelism will help as problem size becomes large. I was getting really awful performance with my implementation (using the iGPU, unified memory), first time trying something like this.

Are there any large benchmarks in the cmaes benches?
Are you very familiar with linear algebra + GPU programming? Right now I'm stuck applying repeated matrix multiplications against individual vectors (covariance transform against parameter vectors) orchestrated by a CPU-side loop, with the added pitfall that I need to copy buffers back and forth and change between floating point precisions because of the float type issue in CMA-ES. There must be a way to perform this operation that takes advantage of more parallelism on the GPU.
- Taking a normally distributed random matrix of size problem_dimension x population_size (column x row) can be done in parallel by arrayfire as a built-in (I think it happens on the GPU, unclear), and then I would like to perform many independent matrix-vector multiplications, each column of matrix Z (random data) independently matrix multiplied with all of matrix T (transform). I think you can actually do this with a single matrix multiply & some transposition, but I'm not sure how to get it to work.

Do you think it would be acceptable to replace all of the linear algebra representations in the crate with arrayfire stuff, assuming we can actually get a massive speedup in the GPU case? Arrayfire falls back to CPU implementations, supports OpenCL and CUDA, and has really broad compatibility. However, it requires that ArrayFire binaries be installed on the machine to function at all.

wbrickner commented 2 years ago

Ok, I think I have realized that no transpose is required and you can just do a single matrix multiplication. Does this look correct?

      transformed                                                       
     column vectors    =    covariance transform    *    random matrix  
   ┌0┬─┬─┬─┬─┬─┬─┬─┐      0─1─2─3─┬─┬─┬─┬─┬─┬─┬─┐      ┌0┬─┬─┬─┬─┬─┬─┬─┐
   ├1┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├1┼─┼─┼─┼─┼─┼─┼─┤
   ├2┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├2┼─┼─┼─┼─┼─┼─┼─┤
   ├3┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├3┼─┼─┼─┼─┼─┼─┼─┤
   ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
   ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
   ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
 ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤    ▲ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
 │ ├─┼─┼─┼─┼─┼─┼─┼─┤    │ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
 │ └─┴─┴─┴─┴─┴─┴─┴─┘    │ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘    ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤
 BxP────▶               BxD────▶                     │ ├─┼─┼─┼─┼─┼─┼─┼─┤
                                                     │ └─┴─┴─┴─┴─┴─┴─┴─┘
                                                     DxP────▶

The random matrix contains column vectors (individuals, each with D components) which are multiplied by the rows of the covariance transform matrix, and end up occupying the corresponding column in the result I think.

Do I have the dimensions right in the figure? That we end up with a BxP matrix? I am going off the fact that the cov_transform method claims it provides a B x D matrix, and I know I am supplying a DxP matrix (P is population, D is dimension, and I don't know what B is!).

wbrickner commented 2 years ago

This is what I have so far, although I don't think both it and the picture can be correct, as we ought to end up with vectors of length D, no?

/// Shared logic between `sample` and `sample_parallel`
fn sample_internal<
    P: Fn(Vec<DVector<f64>>, &mut F) -> Result<Vec<EvaluatedPoint>, InvalidFunctionValueError>,
>(
    &mut self,
    state: &State,
    mode: Mode,
    _parallel_update: bool,
    evaluate_points: P,
) -> Result<Vec<EvaluatedPoint>, InvalidFunctionValueError> {
    // 
    //  I found it very confusing at first, but have realized you can in fact accomplish 
    //  the entire task of many independent column-vector-by-matrix multiplications in a 
    //  single matrix multiplication:
    //  
    //       transformed                                                       
    //      column vectors    =    covariance transform    *    random matrix  
    //    ┌0┬─┬─┬─┬─┬─┬─┬─┐      0─1─2─3─┬─┬─┬─┬─┬─┬─┬─┐      ┌0┬─┬─┬─┬─┬─┬─┬─┐
    //    ├1┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├1┼─┼─┼─┼─┼─┼─┼─┤
    //    ├2┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├2┼─┼─┼─┼─┼─┼─┼─┤
    //    ├3┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├3┼─┼─┼─┼─┼─┼─┼─┤
    //    ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
    //    ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
    //    ├─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
    //  ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤    ▲ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
    //  │ ├─┼─┼─┼─┼─┼─┼─┼─┤    │ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤      ├─┼─┼─┼─┼─┼─┼─┼─┤
    //  │ └─┴─┴─┴─┴─┴─┴─┴─┘    │ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘    ▲ ├─┼─┼─┼─┼─┼─┼─┼─┤
    //  BxP────▶               BxD────▶                     │ ├─┼─┼─┼─┼─┼─┼─┼─┤
    //                                                      │ └─┴─┴─┴─┴─┴─┴─┴─┘
    //                                                      DxP────▶           
    // 
    //

    // i.i.d normally-distributed matrix
    let random_matrix = randn::<f32>(dim4!(self.dim as u64, self.population_size as u64));

    // covariance transform matrix
    let buffer = {
      let ct = state.cov_transform();
      let mut buff = F32_BUFFER_POOL.pull(|| Vec::with_capacity(ct.len()));
      if buff.len() < ct.len() { buff.resize(ct.len(), 0.0); }

      // copy into buffer (nalgebra promises it is column-major, so this is safe)
      ct.as_slice().iter()
        .zip(buff.iter_mut())
        .for_each(|(source, target)| *target = *source as f32);

      buff
    };
    let covariance_transform = Array::new(
      &buffer[..],
      dim4!(state.cov_transform().ncols() as u64, state.cov_transform().nrows() as u64)
    );

    // matrix, the columns of which are our transformed vectors.
    let transformed_vectors = matmul(&covariance_transform, &random_matrix, MatProp::NONE, MatProp::NONE);
    let tv_size = transformed_vectors.dims().elements() as usize;

    let points = {
      // copy the matrix back to host device
      let mut local_buffer = F32_BUFFER_POOL.pull(|| Vec::with_capacity(tv_size));
      if local_buffer.len() < tv_size { local_buffer.resize(tv_size, 0.0); }
      transformed_vectors.host(&mut local_buffer[..]);

      let mut local_buffer_f64 = F64_BUFFER_POOL.pull(|| Vec::with_capacity(tv_size));
      local_buffer_f64.clear();
      for x in local_buffer.iter() { local_buffer_f64.push(*x as f64); }

      // convert the matrix to a list of vectors
      let rows = transformed_vectors.dims().get()[1] as usize;
      let mut vector_list = Vec::with_capacity(self.population_size);
      for chunk in local_buffer_f64.chunks(rows) {
        vector_list.push(DVector::from_column_slice(chunk));
      }

      vector_list
    };

    // Evaluate and rank points
    let mut points = evaluate_points(points, &mut self.objective_function)?;

    self.function_evals += points.len();

    points.sort_by(|a, b| mode.sort_cmp(a.value, b.value));
    Ok(points)
}

(All of the redundant copying / precision conversion stuff would eventually go away)

pengowen123 commented 2 years ago

Are there any large benchmarks in the cmaes benches?

There are not, but you can just add some with larger dimensions. I would recommend adding at the minimum a N=300, lambda=512*default benchmark in order to properly test the sampling efficiency.

Do you think it would be acceptable to replace all of the linear algebra representations in the crate with arrayfire stuff, assuming we can actually get a massive speedup in the GPU case? Arrayfire falls back to CPU implementations, supports OpenCL and CUDA, and has really broad compatibility. However, it requires that ArrayFire binaries be installed on the machine to function at all.

For maximum efficiency this is probably necessary, but making it configurable through a feature somehow would be nice for compatibility (cmaes by default is pure Rust with no external dependencies).

Ok, I think I have realized that no transpose is required and you can just do a single matrix multiplication. Does this look correct?

I see now, this does work with a single matrix multiplication. It seems you have the dimensions mixed up though; the covariance matrix is N x N where N is the problem dimension, and B and D simply refer to the eigenvector and eigenvalue matrices, respectively. But setting up a N x lambda random matrix (where lambda is the population size) and transforming it with cov_transform should be valid, as it produces another N x lambda matrix with properly transformed columns.

If you want to test your changes, cargo test will run a fixed seed test to make sure the results haven't changed. However, this requires that the random numbers are generated in the same way. You can just temporarily change the random number generation to match the old version when running the tests, then switch back once you're sure everything is correct.

EDIT: There is one more major matrix multiplication happening here. It's a N x N by N x 1 and N x 1 by 1 x N multiplication for each sampled point. This one doesn't seem as easy to simplify, but the whole step can probably be done on the GPU.

wbrickner commented 2 years ago

Okay, it will probably be possible to make everything feature gated, but I think the best path to this will involve making the public types backend agnostic, e.g. pass the sample point as &[fxx] rather than a DVector.

Will correct the issue with dimensions & add a new benchmark to validate the speedup.

wbrickner commented 2 years ago

Everything is working correctly, except for the restart::tests::test_fixed_seed test, which disagrees with expectation on the number of function evals:

thread 'restart::tests::test_fixed_seed' panicked at 'assertion failed: `(left == right)`
  left: `5790`,
 right: `5778`', src/restart/mod.rs:708:9

#[test]
fn test_fixed_seed() {
    let function = |x: &DVector<f64>| 1e-8 + (x[0] - 2.0).powi(2) + (x[1] - 1.0).powi(2);
    let strategy = RestartStrategy::Local(Local::new(10, None).unwrap());
    let seed = 96674803299116567;
    let results = RestartOptions::new(2, -1.0..=1.0, strategy)
        .seed(seed)
        .build()
        .unwrap()
        .run(|| function);

    assert_eq!(10, results.runs);
    assert_eq!(5790, results.function_evals);  // <== trouble!
    assert_eq!(RestartTerminationReason::MaxRuns, results.reason);
    let best = results.best.unwrap();
    let eps = 1e-12;
    assert_approx_eq!(1.0000000002890303e-8, best.value, eps);
    assert_approx_eq!(2.00000000075140, best.point[0], eps);
    assert_approx_eq!(0.9999999989799438, best.point[1], eps);
}

IDK why this is happening, I have some checks in place:

  // ...
  // convert the matrix to a list of vectors
  let mut vector_list = Vec::with_capacity(self.population_size);
  for chunk in local_buffer_f64.chunks_exact(self.dim) {
    vector_list.push(DVector::from_column_slice(chunk));
  }

  // ensure the number of vectors (`points.len()`) is correct, and that the buffer size
  // is what we expect.
  debug_assert_eq!(local_buffer_f64.len(), self.dim * self.population_size);
  debug_assert_eq!(vector_list.len(), self.population_size);

  vector_list
};

// Evaluate and rank points
let mut points = evaluate_points(points, &mut self.objective_function)?;

self.function_evals += points.len();

Is the hard-coded value expected by the test actually correct? Why would they differ otherwise? The function eval count I get is deterministic, but disagrees with the expected value.

wbrickner commented 2 years ago

Also, just so I'm clear, the eigendecomposition of a matrix is a process: M => { eigenvalues, eigenvectors }?

wbrickner commented 2 years ago

I'm fairly sure that arrayfire is not actually using the iGPU, and I have not found a way to convince it to do what I want.

pengowen123 commented 2 years ago

The assertion is effectively checking that the same number of iterations is taken to terminate, not that the number of function evals per iteration is the same. So, if any change to the algorithm is made (even a small one), that change will cause different behavior each iteration that eventually adds up to a different number of iterations and therefore function evaluations before termination. If you were to comment that assertion out, the other ones should fail as well because everything is slightly off.

This fixed seed test is a bit more sensitive than the one in tests/general.rs due to running for more iterations. Does the general.rs one pass? If it does, your changes are most likely correct and are only causing issues with this test due to the sensitivity and minor precision differences.

Yes, you are correct about the eigendecomposition.

wbrickner commented 2 years ago

So a rough idea of the sample_internal speedup is now available.

For all tests:

nalgebra is CPU, single thread.
arrayfire is GPU, with the overhead of copying and floating point conversions.

N=16, lambda=16

nalgebra: ~70us per run
arrayfire: ~800us per run

N=128, lambda=16

nalgebra: ~1.6ms per run
arrayfire: ~1.8ms per run

N=128, lambda=512

nalgebra: ~50ms per run
arrayfire: ~13ms per run

N=4096, lambda=512

nalgebra: 140.991s per run
arrayfire: 3.434s per run

pengowen123 commented 2 years ago

These numbers look impressive! Are they the overall per-iteration time, or are you only measuring Sampler::sample_internal? Either way, the performance seems far better than the CPU version already, and I imagine it will improve further when all operations are converted.

wbrickner commented 2 years ago

They are inside one run of sample_internal. Not the whole CMAES iteration, that is much longer than this step.

Next step is translating everything to arrayfire. This is actually very hard. A lot of the things built into nalgebra are not available in arrayfire.

Also, I'm pretty sure arrayfire is underperforming right now. I don't know why, I have a pretty obvious usage.

The reason I believe this is that gpu.rocks/#benchmark gets remarkably better performance for a 4096² matrix multiply using webtech (100ms). I bet the performance relies in part on crazy stuff the chrome team has already done. I think my 4096² time is 470ms, I need to check again.

I have zero clue how well the current implementation would perform on something with more throughput, like a discrete GPU. I have an old 1050Ti, I'll check if CUDA works any better.

pengowen123 commented 2 years ago

That's strange indeed. I'm not sure what could be causing that, but you could check the ArrayFire issue tracker to see if it's affecting anyone else at least.

The operations used in cmaes should be available in ArrayFire, even if they have to be done in a slightly different way. Was there anything in particular you needed that was missing?

wbrickner commented 2 years ago

I have zero clue why perf is worse than GPU.js, I think it might be an apples to oranges comparison. I have a MacBook, and apple is very shamefully shunning OpenCL in favor of Metal. OpenCL is not getting patches or updates at all, and metal perf can be 40% better. It could be that chromium team has a layer between web shaders (that GPU.js makes) and Metal. It could be an issue in arrayfire, idk. Haven't seen anything about it online.

Main gripe so far is that everything is very high level and there is no escape hatch other than to write an OpenCL + Cuda "shader" by hand if what you need is missing.

High level isn't the problem, the problem is I'm not just doing linear algebra & random noise, I have 3 dimensional tensors that I need to perform operations on in a specific axis etc. A lot of what I need can be accomplished in an unintuitive way, but I wish they'd provide a nice clear syntax for declaring tensor operations unambiguously & flexibly.

I just need to dive into every part of CMAES linear algebra usage & see what it's actually trying to accomplish, and find how to translate that & skip work we don't need to be doing. For instance, there's some parts that tamper with the internals of the covariance matrix, trying to make it symmetric about the diagonal IIRC. This may not be necessary anymore, as you can communicate matrix properties to arrayfire & it may be able to take a lower triangular matrix & just read upper element values from lower element memory etc. Stuff like that. I think the more context given, the more it can optimize, as it ships with a JIT "kernel fusion" thing that basically builds a queue of operations & then only executes them at the last possible second with maximum context, allowing it to dynamically merge kernels together and automatically reuse memory in a mostly optimal way.

Right now I'm working on transforming the rank_mu_update code & testing a hunch I have that can reduce the complexity a lot. Essentially, I think I can take all individuals.unscaled_step() vectors together as a matrix, weight by a vector down the columns, sum element-wise across the rows, yielding a single vector, then take the matrix multiplication with it's transpose to yield a matrix. Although, this may not actually be equivalent, and it may be necessary to assemble many v x vT matrices (yielding a very big 3D tensor), weight them all independently, then sum through the individuals axis. This would be a shame as it's vastly more FLOPs and the GPU parallelism runs out eventually & you must contend with the actual scaling complexity of the operation.

Regardless, it's not hard to parallelize the rank_mu step, even if it turns out it's not possible to simplify it.

Every single thing requires tons of copying though, and I think we should consider an implementation in which almost all data remains on the GPU, and modifications are informed by smaller pieces of data (objective function samples) that are sent back to the GPU, guiding the search.

Arrayfire has some interesting examples that got me thinking about this, specifically they have a genetic search algorithm that essentially lives entirely on the GPU. You can do sorting, selection, mutation, crossover, etc. Don't think we should take it that far, but if CMAES could be made almost fully GPU-native I'm sure it would be worth it.

Note: whenever I say GPU, substitute "compute device", arrayfire seamlessly transitions between CPU, GPU, and "accelerator hardware" (like, FPGA? Cool!) depending on availability.

pengowen123 commented 2 years ago

I believe it is necessary to perform the rank-mu update as-is. If each vector is weighted and summed before the transpose and multiplication, the weight of the resulting matrix will effectively be the weight squared, as you would be multiplying (weight * step) * (weight * step).transpose() for each vector. This is avoided in the current code by weighting only the original vector and not the transpose, but I'm not sure if you can replicate this when summing first (using weight.sqrt() doesn't work here because some weights are negative).

I think it's worth considering a mostly GPU version of cmaes, even if it requires writing custom shaders. The overall flow of CMA-ES is:

Sample points from the distribution
Evaluate and rank the points
Use the ranked points to update the state
Check termination criteria

So, one potential design that maximizes GPU usage would be:

Store all state on GPU
Sample points on GPU
Pull state onto CPU and check termination criteria
Pull points onto CPU and evaluate/rank them
Send ranked points to GPU
Perform state update on GPU

This way there are only two transfers of a N x lambda matrix and one transfer of the state per iteration, and all linear algebra is done purely on the GPU (or other compute device if ArrayFire is used). This would be a lot more work than what you're doing now though, and I'm not sure how best to incorporate this kind of design into cmaes, so it can just be left as a future plan for now.

wbrickner commented 2 years ago

So I think I have everything worked out in terms of the rank_mu update, from a performance perspective. Tests pass, but I'm not sure if that guarantees all the numbers are correct inside.

The only expensive operations remaining are the buffer copies, which occupy ~97% of execution time (usually about 50ms for each buffer!), and will go away if we switch to storing data on the GPU.

I estimate an N=128, lambda=128 single pass will end up taking something like 3.39 milliseconds once the buffer overhead is gone. I'm pretty sure there are better ways to write / structure it that would improve performance a lot more.

The eigendecomposition work is all still nalgebra / CPU.

There are some issues I've encountered though, which pertain to GPU usage.

The GPU has finite memory! I cannot run the N=4096, lambda=4096 benchmark, I have only 1500mb video ram, and my implementation accesses maximum parallelism at the expense of video memory usage. The memory usage explodes because I am using 3D tensors in places, e.g. 4 bytes `4096 4096 * 14,000=939gb`.

This means we need to actively break down the process into a sequence of maximum manageable chunks, at least for the steps that use 3D tensors that can grow insanely large.

Even with that issue resolved, as problem dimension grows, the memory requirements grow at least as large as max(D^2, lambda * D) because of the matrices used. If things could be refactored / reformalized somehow to avoid this, that would be ideal.

For the rank_mu portion, the matrices we create are always going to be symmetric across the diagonal, so in principle only the lower portion needs to be constructed. I do not know how to specify this through arrayfire.

I don't want to end up creating a system that works for N=4096 / N=16,000 problems, but locks people out prematurely (they'll need an extremely expensive GPU with lots of memory). The fundamental space complexity requirements remain O(n^2) unfortunately.

pengowen123 commented 2 years ago

Tests pass, but I'm not sure if that guarantees all the numbers are correct inside.

If the fixed seed tests are passing, your changes almost certainly have not affected the algorithm's behavior aside from negligible rounding/precision differences.

The only expensive operations remaining are the buffer copies, which occupy ~97% of execution time (usually about 50ms for each buffer!), and will go away if we switch to storing data on the GPU.

The N * lambda population matrix will need to be copied at least once to the CPU in order to pass the vectors to the objective function, and probably back to the GPU as well after sorting. So, there will still be some overhead to all this, but it should become less significant as N increases (and it will depend on your GPU).

The GPU has finite memory! I cannot run the N=4096, lambda=4096 benchmark, I have only 1500mb video ram, and my implementation accesses maximum parallelism at the expense of video memory usage. The memory usage explodes because I am using 3D tensors in places, e.g. 4 bytes 4096 4096 * 14,000 = 939gb

I wouldn't expect memory to be an issue at this scale, as 4 bytes `4096 4096is only ~67MB. What is the14,000for in your example? If it does turn out to be an issue though, the operation can freely be split up into any number ofN * Lmatrices, where allLvalues sum tolambda`.

wbrickner commented 2 years ago

If the fixed seed tests are passing, your changes almost certainly have not affected the algorithm's behavior aside from negligible rounding/precision differences.

I think this is the case.

The N * lambda population matrix will need to be copied at least once to the CPU in order to pass the vectors to the objective function, and probably back to the GPU as well after sorting. So, there will still be some overhead to all this, but it should become less significant as N increases (and it will depend on your GPU).

Yes, the N * lambda copy is required, but I think only a lambda-length vector needs to be sent back to the GPU, which can function as a sorting guide for the columns of the matrix. The GPU can sort the matrix itself.

I wouldn't expect memory to be an issue at this scale, as 4 bytes 4096 4096 is only ~67MB. What is the 14,000 for in your example? If it does turn out to be an issue though, the operation can freely be split up into any number of N * L matrices, where all L values sum to lambda.

This is the plan, although finding the maximum manageable stack size is not really possible, as I can't get information about device memory size. The C/C++ API provides this, but the rust bindings seem to have forgotten it. The 14,000 comes from N=4096, Lambda=4096 default, so you have 4096 default matrices, each 4096x4096, each float 4 bytes, it gets big in the rank_mu step.

Previously, this is what I was testing for performance, as in the earlier step (transform matrix etc) you only get an N * lambda, which is ~67mb as you say.

pengowen123 commented 2 years ago

Oh, I see now. The two main options I can think of are either:

Use an iterative solution instead of a parallel one. This vastly reduces the memory requirements for larger problems, but could be worse for performance.
Use a parallel solution for better performance and accept a practical limit on the problem size. A 4096-dimensioned problem with lambda=4096 is enormous, and probably larger than anything likely to be encountered in practice.

I think the second option is better in most cases, and switching to the first shouldn't be too difficult if this doesn't work out.

wbrickner commented 2 years ago

Yeah, I'd like to use a hybrid approach wherein we still leverage matrix-level parallelism AND many-matrix parallelism, but never overflow memory by chunking up the work.

However, this is really hard to do efficiently in a multi-thread environment in which many CMAES optimization passes are running concurrently.

I think for now (and without ability to inspect device memory size!) I'll leave it as-is.

wbrickner commented 2 years ago

Turns out there were a few bugs in my original implementation that are resolved now.

Will try to tackle eigendecomposition next. Arrayfire provides you a QR decomposition algorithm built-in, and I understand how to use that to get eigenvalues and eigenvectors (I think).

What I don't know exactly how to do is assess convergence of the decomposition. A lot of the code I've read just blindly iterates some number of times, whereas the original nalgebra function dynamically examines convergence & compares it to a tolerance parameter. It's unclear exactly how this is being done.

There are also many different flavors of eigendecomposition, and I'm not sure what properties QR has / lacks.

Any input on this?

Here is a visualization I stumbled on today:

https://user-images.githubusercontent.com/2095902/174763068-4a881f06-b1a8-42f3-ae08-b7050a09f191.mp4

The left / right matrices are random and symmetric respectively, I think.

When we talk about decomposition convergence, does this refer to the magnitude of the non-diagonal components? It's difficult to find straight answers about this on the web.

Also, I'm thinking it may make sense to provide mutually exclusive implementations for subcomponents of CMAES that are then conditionally compiled by feature flags. The changes required for a good version of arrayfire implementation are far-reaching, e.g. changes to lots of different structs to store a pointer to GPU memory. Your existing implementation will remain default (if you want), and be guarded by #[cfg(not(feature = "arrayfire")))], and the arrayfire implementation will exist behind #[cfg(feature = "arrayfire")]. We can then mirror this flag setup in eant2.

In order to hide the implementation completely, the CMAES public API should no longer make use of nalgebra constructs like DVector, hopefully opting for &[f32] or &[f64], or even Reusable<Vec<f32>> etc.

Another option, hypothetically, is to dynamically choose between a backend based on problem size. The nalgebra implementation is a lot faster than arrayfire for small sizes. The optimal threshold to switch to arrayfire will depend massively on the hardware: CPU, GPU(s), cache sizes + memory hierarchy, memory speed, OpenCL version, data type, etc etc etc.

I was thinking (about something I don't want to build IRL) a persistent store that makes the assumption that there is a partition on problem sizes, a single critical point beyond which arrayfire is always better for performance, and it measures this point at first execution, so we know the optimal switch point tuned for the hardware for all future runs.

pengowen123 commented 2 years ago

I'm not familiar with QR decompositions unfortunately. From what I've read, it seems that convergence has something to do with the magnitudes of non-diagonal elements, but I'm not sure exactly what the relationship is.

At some point this will probably be the best solution, as very little would actually be shared between the existing code and the ArrayFire code. Automatically determining which backend to use is probably outside the scope of the crate, but the provided benchmarks should be enough to help users determine which to use for their particular problem.

pengowen123 / eant2