CGE encoding / topology question

wbrickner commented 2 years ago

Hello!

I'm trying to write a new CGE library with different goals, but first I must understand the evaluation of CGE networks!

I am a little stuck in understanding how the topology of CGE network is encoded implicitly.

Is there some way I can extract a 'topological ordering' of the network, separating it into "layers" of computation that must occur?

Thank you.

wbrickner commented 2 years ago

Equivalently is there any process by which I can convert a CGE encoding to a more abstract encoding, like a petgraph? My big problem is that I can't understand how CGE actually works.

My goal is to analyze all the data dependencies, create an optimized evaluation function at compile time (all inside a proc macro). This cuts down on size and runtime overhead for network evaluations, ripest target is embedded or wasm, but works in std environments as well.

The goal is to be able to say something along the lines of:

let mut network = cge!("./my_little_network.json");
network.evaluate(&inputs, &mut outputs);

Which expands out to something like:

struct Network {
   /// analysis determined there are only 2 neurons who's output is used recurrently, so only 2 persistent floats are needed.
   persistent: [f32; 2]
}
impl Network {
  /// analysis determined there are 6 inputs, 2 outputs.
  fn evaluate(inputs: &[f32; 6], outputs: &mut [f32; 2]) -> {
    let n0 = (0.123 * inputs[0]) + (0.33 * inputs[1]);
    let n1 = (0.770 * inputs[0]) + (0.40 * inputs[3]);
    // ...
    outputs[0] = (0.64 * n9) + (0.22 * n2) - 0.355;
    outputs[1] = (0.22 * n9) + (0.95 * n2) + 0.889;
    // ...
  }
}

This puts all the hard work on LLVM:

it determines for us the smallest number of float registers / stack space that can be used,
pipeline- and dependency-aware instruction reordering,
maybe using vector instructions

pengowen123 commented 2 years ago

The genome of a network is laid out linearly and can be evaluated without considering the topology by using the rules of Reverse Polish Notation. You traverse the genome back to front, adding Input genes onto the stack and popping however many are needed when encountering a Neuron gene, whose output then gets put onto the stack. Forward jumps (effectively links to the output of other neurons) first evaluate the subnetwork they refer to and then place the result onto the stack. Recurrent jumps retrieve the value of the pointed-to neuron from the previous evaluation and place it onto the stack.

A full description of the CGE encoding can be found in this paper.

Ad-hoc code generation could be good for performance, but it is limited to networks known at compile time. Run-time construction of networks (as used in eant2) must be supported as well. Another thing to consider is how many networks are used in this way, as if too many are constructed then the amount of code generated could hurt caching and make it slower overall. Still, it would be worth investigating at some point because this library could conceivably be used in a setting where performance matters.

wbrickner commented 2 years ago

Yeah, good points, it may not cover all use cases, and if you have 50 models it will not work well, but I would like to run EANT2 networks on my new RP Pico board! On something like an Arduino or my RP2040, (AFAICT) the CPU is almost 1970s style, has no pipeline, has no memory hierarchy, etc.

Often, DIY projects that utilize machine learning need to contend with TFLite bindings and model compression, tons of python glue, etc etc. The story here is you've got one or two tasks, one or two networks, you want them to run /fast/ on insanely constrained hardware, and you want to do it in 2 lines of code & never think about how it works. There's good demonstration of EANT2 networks in recurrent control tasks / non-recurrent estimation tasks, which is usually what you'd want to do on embedded hardware.

So, a few things I don't understand about this process:

how shared inputs work? The output of a neuron can go to many others, but as soon as you encounter a neuron that consumes it, you take the output value off the stack, making it inaccessible to other neurons which may need it.
What does it mean to evaluate a subnetwork? Why is it done this way? I see notes in the repo about inefficiencies from double-evaluation, and this was one of the parts that really confused me.
Are the GeneExtras::Input variants /only/ what I think of as "network inputs", or are they used in other ways? The paper makes mention of some other use I think.

Thank you for your guidance, I hope I am able to get there!

pengowen123 commented 2 years ago

That sounds interesting! I'd be curious to hear what you plan on doing with it.

For the use cases you listed, some form of ad-hoc model generation would be a nice feature to implement (something along the lines of your second comment). However, it would have to be in addition to the more usual, dynamic approach in order to support other use cases. I'm not sure how best to avoid code duplication here.

how shared inputs work? The output of a neuron can go to many others, but as soon as you encounter a neuron that consumes it, you take the output value off the stack, making it inaccessible to other neurons which may need it.

For shared inputs, that's the purpose of the forward jumper genes. They take the output of a neuron and place it on the stack again, and this mechanism allows neuron outputs to be used more than once.

What does it mean to evaluate a subnetwork? Why is it done this way? I see notes in the repo about inefficiencies from double-evaluation, and this was one of the parts that really confused me.

Subnetworks refer to the inputs to a neuron, and the inputs to those inputs if they are neurons themselves, and so on. To take the output of a neuron is equivalent to (and sometimes requires) evaluating its corresponding subnetwork. I believe this is currently implemented in a naive way that simply re-evaluates a subnetwork each time a forward jumper needs its value, but this can be avoided by simply caching the output on first evaluation and reusing it instead.

Are the GeneExtras::Input variants /only/ what I think of as "network inputs", or are they used in other ways? The paper makes mention of some other use I think.

IIRC yes, they are only the inputs to the network. The paper also uses the term for the inputs to a neuron though, regardless of whether they are Inputs or other kinds of genes.

wbrickner commented 2 years ago

So I've decided to take a shortcut (sidestepping my lack of thorough understanding of CGE).

I'm just going to execute your magic evaluation function, and anywhere there's a computation I'll append to a list of TokenStream.

Hopefully if order is correct it'll all work correctly, and LLVM will take care of any redundant / pointless code that gets generated.

This would all be trivial (and correctness would be obvious) if I could get a digraph representation of the network. Would you be able to help me piece that together?

pengowen123 commented 2 years ago

I should be able to take a look at refactoring the evaluation code in the next few days if you just want to implement the code generation part.

I would probably start with something like this for building a digraph of a network:

Scan the genome and add any neurons and inputs with unique IDs as vertices.
Do a dummy evaluation of the network that only increments/decrements a tally of how tall the stack is after each gene, recording the final height of the stack. Add a number of output vertices equal to this final value.
Run through the genome again, this time recording the input/neuron ID of every value on the stack. When encountering a neuron, for each stack value popped, add an edge between the current neuron and the input/neuron referred to by the value's ID (different IDs should be used for inputs and neurons).
For each value on the previous evaluation's stack, add an edge between the corresponding output vertex and the input/neuron with the value's ID.

Note that forward and recurrent jumpers are handled slightly differently during evaluation, with forward jumpers taking a neuron's current value and recurrent jumpers taking its previous value.Also, the output vertices aren't strictly necessary, but if you use them then step 2 can probably be incorporated into step 3.

wbrickner commented 2 years ago

So I've gone ahead with my shortcut, and it seems like it's working out well so far. One problem, I can't see if it's correct or not. Does this output look correct to you?

CGE File

Lifted from this repository:

1: n 1 0 3,n 1 1 2,n 1 3 2,i 1 0,i 1 1,i 1 1,n 1 2 4,f 1 3,i 1 0,i 1 1,r 1 0,b 1

Codegen (take two)

pub fn evaluate(&mut self, inputs: &[f64; 5usize], outputs: &mut [f64]) {
  let c0 = 1f64;
  let c1 = 1f64 * self.persistence[0usize];
  let c2 = 1f64 * inputs[1usize];
  let c3 = 1f64 * inputs[0usize];
  let c4 = 1f64 * inputs[1usize];
  let c5 = 1f64 * inputs[0usize];
  let c6 = c5 + c4;
  let c6 = Self::activation(c6);
  let c7 = c6 * 1f64;
  let c8 = c7 + c3 + c2 + c1;
  let c8 = Self::activation(c8);
  let c8 = c8 * 1f64;
  let c9 = 1f64 * inputs[1usize];
  let c10 = 1f64 * inputs[1usize];
  let c11 = 1f64 * inputs[0usize];
  let c12 = c11 + c10;
  let c12 = Self::activation(c12);
  let c12 = c12 * 1f64;
  let c13 = c12 + c9;
  let c13 = Self::activation(c13);
  let c13 = c13 * 1f64;
  let c14 = c13 + c8 + c0;
  let c14 = Self::activation(c14);
  self.persistence[0usize] = c14;
  let c14 = c14 * 1f64;
  outputs[0usize] = c14;
}

pengowen123 commented 2 years ago

That looks correct to me. I think there's a way to avoid duplicating subnetwork evaluations, but I would guess that LLVM can do a pretty good job optimizing it anyways. We can compare the speeds with benchmarks later to see for sure.

Edit: All updates to self.persistence must be done at the end of evaluation to avoid recurrent jumpers possibly reading the updated value instead of the previous one. Now that I think about it, this might be a bug in the original code as well.

wbrickner commented 2 years ago

I've only followed the execution path of the original function.

I set self.persistence[i] if neuron_update is set to true, like:

if neuron_update {
  if let Some(index) = recurrence_table.get(&neuron_id) {
    // if we're told to update state, and if the neuron is recurrent, update the persistence array
    computations.push(quote! {
      self.persistence[#index] = #result_id;
    });
  }
}

where recurrence_table maps a neuron ID to compact index value.

When handling Recurrent jumper gene:

computations.push(quote! {
  let #result_id = #weight * self.persistence[#persistence_index]; // access persistence, apply weighting
});

Not sure if this is correct. CGE still confusing to me.

pengowen123 commented 2 years ago

The behavior from the original code does seem to have a bug in it, but I will fix it in my refactoring. To fix your version you can just move the self.persistence[i] updates to the very end of the function. Assuming you are implementing this recursively, the neuron_update check should be kept so that the updates only take place in the top-level evaluate call and not any sub-calls.

wbrickner commented 2 years ago

I will not pretend to understand the fix, but I will gladly duplicate it locally :) +Watched

wbrickner commented 2 years ago

After thinking about this a little more, I think it doesn't make sense to update self.persistence[i] at the end. self.persistence[i] is the neuron activation of some recurrent neuron, the value of which is sent backwards. When you reach that neuron, you update the value. Seems safe?

pengowen123 commented 2 years ago

That's a reasonable interpretation of it and probably what I originally thought as well, but I see now that the paper uses the notation o(t) and o(t - 1) to denote the output of a neuron at different times (i.e., evaluations). This seems to imply that the previous set of outputs o(t - 1) is entirely distinct from the current set o(t).

pengowen123 commented 2 years ago

The evaluation bug is fixed now. I'll work on another PR that refactors most of the code, as the current Gene struct is very messy and bug-prone.

wbrickner commented 2 years ago

I think you may find this interesting.

My existing codegen'd functions match your newest patch.

running 3 tests
test test::recurrent_1000_cycles ... ok
test test::recurrent_5_cycles ... ok
test test::recurrent_memorywipe ... ok

This leads me to believe either I was correct all along or your new patch didn't actually change behavior.

So either:

my macro worked correctly and the old cge code didn't, but now the new patched code works & matches my correct output
my macro worked incorrectly, the old cge code worked incorrectly, and the new cge code works incorrectly, but my macro matches the behavior of both the new and old versions (because it actually didn't change).

Let me know what you think, I'm waiting for the core CGE logic argument to be resolved before I will publish.

pengowen123 commented 2 years ago

The problem is only visible in the specific case where a recurrent jumper refers to a neuron later in the genome, which means that the jumper will be evaluated after the neuron and incorrectly use its current value. Can you try running your code through this new test and seeing if it works? This test fails for me on the previous commit but is fixed by the latest one.

wbrickner commented 2 years ago

Yes, fails for me, will see if I can get it to pass / integrate the other test cases

wbrickner commented 2 years ago

Delaying all mutations of self.persistence to the end in my implementation passes all tests (mine + the specific one you mentioned). I have yet to add every test from cge.

Is this approach well founded? Like, /should/ this be working?

pengowen123 commented 2 years ago

Yes, this does seem correct. The explicit function for the current output of a neuron given on page 1032 of the paper defines the current output of a neuron to be o(t), and distinguishes this from the o(t - 1) used to refer to the neuron's previous output. So, any calculation done for a neuron during evaluation would fall under the o(t) category, while o(t - 1) refers exclusively to the outputs from the previous evaluation. Therefore, in order to keep the two separate, o(t - 1), which is self.persistence in your code, must not be overwritten during evaluation and can only be done at the very end.

wbrickner commented 2 years ago

So after adding more genomes to test on I've found that our implementations differ in results for the Figure 3.5 network and the "Figure 3.5 plus one". They're within 1% of eachother, but I think they should actually be a bitwise match.

pengowen123 commented 2 years ago

That's strange. Can you share the full code generated for the two networks?

wbrickner commented 2 years ago

Notes:

Running my memorywipe test passes with bit-level correctness (both the static and dynamic networks are effectively reset for each input).
I've walked through it with a diagram to be very sure, and the network eval seems correct on paper,
The bug is subtle and only presents itself in the recurrency / many eval scenario.
It's the same bug between the figure_5_3_paper_plus_one.cge as the figure_5_3_paper.cge, I'll give you the Figure 5.3 because I have a nice picture + smaller codegen.
self.persistence is initialized to all zeros, the same as in the dynamic cge::Network.

Network

97-Figure5 3-1

CGE file contents:

0: n 0.6 0 2,n 0.8 1 2,n 0.9 3 2,i 0.1 0,i 0.4 1,i 0.5 1,n 0.2 2 4,f 0.3 3,i 0.7 0,i 0.8 1,r 0.2 0

Generated Code

pub fn evaluate(&mut self, inputs: &[f64; 5usize], outputs: &mut [f64; 1usize]) {
  let c0 = 0.2f64 * self.persistence[0usize];
  let c1 = 0.8f64 * inputs[1usize];
  let c2 = 0.7f64 * inputs[0usize];
  let c3 = 0.4f64 * inputs[1usize];
  let c4 = 0.1f64 * inputs[0usize];
  let c5 = c4 + c3;
  let c5 = const_cge::activations::f64::linear(c5);
  let c6 = c5 * 0.3f64;
  let c7 = c6 + c2 + c1 + c0;
  let c7 = const_cge::activations::f64::linear(c7);
  let c7 = c7 * 0.2f64;
  let c8 = 0.5f64 * inputs[1usize];
  let c9 = 0.4f64 * inputs[1usize];
  let c10 = 0.1f64 * inputs[0usize];
  let c11 = c10 + c9;
  let c11 = const_cge::activations::f64::linear(c11);
  let c11 = c11 * 0.9f64;
  let c12 = c11 + c8;
  let c12 = const_cge::activations::f64::linear(c12);
  let c12 = c12 * 0.8f64;
  let c13 = c12 + c7;
  let c13 = const_cge::activations::f64::linear(c13);
  let c13 = c13 * 0.6f64;
  outputs[0usize] = c13;
  self.persistence[0usize] = c13;
}

_{^{You can see that the network is repeating computations like you mentioned, but it is pretty innocuous and LLVM
will surely notice this. If not, I can track results that are reusable in a greedy manner.}}

Output

thread 'tests::figure_5_3_paper::recurrent_5k_cycles_1k_trials' panicked at 'assertion failed: `(left == right)`
  left: `[0.32982528903255776]`,
 right: `[0.3306484728843932]`', src/tests.rs:165:9

The 'left' is codegen result, the 'right' is dynamic result, with a difference of -0.00248960428 (-0.24%)
I think it's worth it to solve this because you should be able to trust my macro, plus this type of error can compound in deeper networks / over more evaluation cycles.

wbrickner commented 2 years ago

So I think I've found the bug.

Notice that self.persistence[0] receives the weighted value rather than the raw activation as it should. This is because we need to do reordering to the end.

wbrickner commented 2 years ago

Issue resolved (retain the intermediate value and use that identifier at the end for updating persistence).

Do you have any other networks you've got that I can try? Is it possible to assemble thousands of random networks procedurally to test? I think I recall one of the desirable properties of CGE is that

any possible network has a CGE representation, and
any possible CGE genome is a valid network, right?

Thanks for all your help!

pengowen123 commented 2 years ago

Nice catch!

The only test networks I have were the ones you already found in the code. However, I plan on adding the basic mutation operators directly to this crate after I refactor some of it so that correctness can be guaranteed. Then some randomized network tests can be added easily. It is possible to construct an invalid CGE genome, so random mutations must be done carefully.

You can go ahead and open the PR if you feel the code is correct. More tests can be added later.

wbrickner commented 2 years ago

I have it living as a repo for now, will publish to crates soon as a separate crate: https://github.com/wbrickner/const_cge

I'm envisioning an ecosystem for (a little unconventional) machine learning (embedded!) in rust that centers on eant2 and const_cge (like how async programming has Tokio + Hyper + Tower + Tonic + Mio etc), and I designed us some fancy graphics.

If you want to use them, pls do, it'll make us match!

wbrickner commented 2 years ago

Also turns out I can't publish any crates that depend on cge, as it's all stored in git. Would you mind publishing a v0.0.0 of cge? I know you want to rewrite the internals etc, so your 0.1.0 can be pushed later on.

pengowen123 commented 2 years ago

Alright, v0.0.1 should be live.

const_cge looks very nice! I'll add the graphics in when I clean things up a bit. As for having a small ecosystem, I think that will work well. With eant2 fixed it should be easy to train neural networks and load them up with either cge or const_cge. I'll go ahead and remove cge_core now as const_cge already implements its vision.

pengowen123 commented 2 years ago

One small thing I noticed is the number of inputs for the figure 3.5 network seems to be 5, but it should be 2. The number of inputs to a network is the number of unique input IDs found among all Input genes in the genome.

wbrickner commented 2 years ago

Uh oh... Yanking time. Let me check it out.

let input_count = {
  let mut input_ids = HashSet::new();
  network
    .genome
    .iter()
    .filter_map(|g| match g.variant {
      GeneExtras::Input(_) => Some(g.id),
      _ => None
    })
    .for_each(|id| { input_ids.insert(id); });

  input_ids.len()
};

This new code look like a correct way to count inputs?

pengowen123 commented 2 years ago

Yeah, that looks right.

pengowen123 / cge