nengo / nengo-loihi

Run Nengo models on Intel's Loihi chip
https://www.nengo.ai/nengo-loihi/
Other
35 stars 12 forks source link

"Number of AxonCfg registers exceeded" Error #326

Open Michaeljurado42 opened 2 years ago

Michaeljurado42 commented 2 years ago

Even though my snn neural networks work on the simulator and nengo-loihi checks for axon overflow error before uploading the network to loihi, I still occasionally get the following error:

username@ncl-edu:~neuromorphics$ SLURM=1 PARTITION=loihi_2h python bug_report.py 
34
32
34
33
INFO:DRV:  SLURM is being run in background
INFO:DRV:  Connecting to 10.212.98.108:34499
INFO:DRV:      Host server up..............Done 0.20s
INFO:DRV:      Encoding axons/synapses.....Error 14.35s
INFO:DRV:  SLURM is being run in background
INFO:DRV:  Connecting to 10.212.98.108:32967
INFO:DRV:      Host server up..............Done 0.19s
INFO:DRV:      Encoding axons/synapses.....Error 14.05s
INFO:DRV:  SLURM is being run in background
INFO:DRV:  Connecting to 10.212.98.108:40425
INFO:DRV:      Host server up..............Done 0.18s
INFO:DRV:      Encoding axons/synapses.....Error 14.50s
Traceback (most recent call last):
  File "path/bug_report.py", line 54, in <module>
    with nengo_loihi.Simulator(
  File "path/nengo-loihi/nengo_loihi/simulator.py", line 223, in __enter__
    sim.__enter__()
  File "path/nengo-loihi/nengo_loihi/hardware/interface.py", line 121, in __enter__
    self.connect()
  File "path/nengo-loihi/nengo_loihi/hardware/interface.py", line 190, in connect
    raise SimulationError(
nengo.exceptions.SimulationError: Board connection error: Number of AxonCfg registers exceeded on logical chip 2/physical core 13 (logical core [265]) while creating linkage to logical chip 2/physical core 141 (logical core [281]). Allowed limit is 4096

I receive this error for various neural network configurations but I provide a sample below that reproduces this bug:

from tensorflow.keras.layers import *
from nengo_loihi.hardware.allocators import  PartitionInterchip
import tensorflow as tf
import nengo_dl
import nengo
import numpy as np
import nengo_loihi

# define a convolutional network
def simple_neural_network():
  inputs = Input(shape=(64, 64, 3))
  spiking_input = Activation(tf.nn.elu)(inputs)  # this will be replaced by SpikingRectifiedLinear
  conv_out1 = Conv2D(32, (5, 5), strides=(2, 2), padding='valid', activation=tf.nn.relu, use_bias = False)(spiking_input)
  conv_out2 = Conv2D(64, (3, 3), strides=(1, 1), padding='valid', activation=tf.nn.relu, use_bias = False)(conv_out1)
  conv_out3 =  Conv2D(128, (2, 2), strides=(2, 2), padding='valid', activation=tf.nn.relu, use_bias = False)(conv_out2)
  conv_out4 =  Conv2D(128, (2, 2), strides=(2, 2), padding='valid', activation=tf.nn.relu, use_bias = False)(conv_out3)

  flat_out = Flatten()(conv_out4)
  output = Dense(4, activation=None,name="dense", use_bias = False)(flat_out)

  model = tf.keras.Model(inputs=inputs, outputs=output)
  return model

ann_model = simple_neural_network()
nengo_converter = nengo_dl.Converter(
    ann_model,
    swap_activations={tf.nn.relu: nengo_loihi.LoihiSpikingRectifiedLinear(), 
                        tf.nn.elu: nengo.SpikingRectifiedLinear()}, # this is our spiking input to the SNN
)

# Specify first layer as running off chip
with nengo_converter.net as net:
    nengo_loihi.add_params(net)  # allow on_chip to be set
    net.config[nengo_converter.layers[ann_model.layers[1]].ensemble].on_chip = False

# Define our into to the SNN
nengo_input = nengo_converter.inputs[ann_model.layers[0]]
with nengo_converter.net as net:
    nengo_input.output = nengo.processes.PresentInput(
        np.random.random((12, 64, 64, 3)), presentation_time=.02
    )

# Assign block shapes
block_sizes = [None, (16, 16, 4), (8, 8, 16), (8, 8, 16)]
conv_layers = [layer for layer in ann_model.layers if "conv" in str(layer).lower()]
for layer, block_size, layer_idx in zip(conv_layers, block_sizes, range(len(conv_layers))):
    if block_size == None: #
        continue
    output_shape = tuple(layer.output.shape[1:])
    with nengo_converter.net as net:
        net.config[net.ensembles[layer_idx+1]].block_shape = nengo_loihi.BlockShape(block_size, tuple(output_shape))

# Try to run the network on loihi
with nengo_loihi.Simulator(
    nengo_converter.net,
    hardware_options=dict(n_chips = 4, allocator = PartitionInterchip(), snip_max_spikes_per_step = 3500),
    precompute = False,
) as sim:
    model_utilization = sim.model.utilization_summary() # print model utilization
    print("\n".join(model_utilization))
    quit()
hunse commented 2 years ago

The short answer is that we're not perfect about estimating the number of axons that will be used by NxSDK, so sometimes our verification thinks the network is OK, but then you run into an NxSDK error like you have.

That said, when I've encountered this problem before, it's more often been on connections between chips (since those axons can take an additional axon slot to implement). In your case, it appears to be within the same chip, which in my experience is much less common. It's not clear to me why this happens; if someone figures it out, then we can add it to our code so that we'll more accurately measure numbers of axons both for the utilization summary and our validation.

To debug, I would try to figure out how many axons we think this core should have, and how many NxSDK thinks it has. Then we need to figure out why there's a difference. Definitely something worth looking into, and I'll add it to our backlog. If you or anyone else has a chance to look into it beforehand, let us know what you find.

As a workaround, you can look at the utilization summary and see which blocks are the highest in term of output axons, since it's probably one of those that's the culprit. (You might need to run with target="sim" or make the nengo_loihi.Simulator without entering it, i.e. without the with, to avoid the error.) Then, you can try to massage the block shapes to reduce this.