nengo / nengo-loihi

Run Nengo models on Intel's Loihi chip
https://www.nengo.ai/nengo-loihi/
Other
35 stars 12 forks source link

Memory accumulation after multiple sim.close() #311

Closed nmnaughton closed 2 years ago

nmnaughton commented 3 years ago

Hello.

I am using the emulator in nengo_loihi v1.0.0 and am trying to reset and rerun an ensemble multiple times. When I do this, I notice that the memory used by the process quickly accumulates. If I run the same code using regular nengo I do not get this memory accumulation. It seems that maybe sim.close() does not release all the memory associated with the network? Could it be a probe issue? Below is a MWE. I am using python 3.7.7 on macOS 11.1. Any help diagnosing the problem would be appreciated. Thanks!


import nengo
import nengo_loihi
import numpy as np

class NewClass:
    def __init__(self):
        self.input_size = 10
        self.n_neurons = 500
        self.initialize_nengo()

    def initialize_nengo(self):
        network = nengo.Network()
        with network:
            def input_func(t):
                return np.ones(10)

            def output_func(t, x):
                self.output = x 

            input_layer  = nengo.Node(output=input_func, size_in=0, size_out = self.input_size)
            ensemble     = nengo.Ensemble(n_neurons=self.n_neurons, dimensions=1 )
            output_layer = nengo.Node(output=output_func, size_in=self.n_neurons, size_out=0)

            conn_in  = nengo.Connection(input_layer, ensemble.neurons, transform=np.ones((self.n_neurons,self.input_size)))
            conn_out = nengo.Connection(ensemble.neurons, output_layer)

        self.sim = nengo_loihi.Simulator(network, precompute=True)
        # self.sim = nengo.Simulator(network, progress_bar=False)

    def run(self, steps, num_resets):
        for i in range(num_resets):
            self.sim.run_steps(steps)
            self.reset_nengo()
            print('finished iteration:', i)

    def reset_nengo(self):
        self.sim.close()
        self.initialize_nengo()

steps = 1000
num_resets = 500

nengo_class = NewClass()
nengo_class.run(steps,num_resets)
nmnaughton commented 3 years ago

I've confirmed this issue also occurs on windows. And anecdotally I have noticed that when run on loihi hardware the simulation time on the chip as well as the time to transfer data from chip2host increases the more times I reset and rerun, which may or may not be the same.

xchoo commented 3 years ago

I can confirm this to be the case. It seems related to the amount of data being transferred on and off the chip (even if it's just the emulator running). Removing the connection to the output layer reduces the amount of memory leak, but it still leaks. Changing the number of neurons (i.e., increasing the dimensionality of the chip output) or increasing the number of timesteps increases the amount of memory leakage.

Converting the code to use the python with context block didn't affect the amount of leakage as well. Memory leak doesn't happen when using the core Nengo simulator, so it seems to be NengoLoihi specific.

nmnaughton commented 3 years ago

I dug into this a bit more. As xchoo said, increasing the number of timesteps increases the leakage, however, I realized that even if you do not call run_step(), you get leakage. So it appears that just creating the network multiple times leads to the leakage. Following this, I dug through the initalizations and found calling connection.build_host_to_chip and connection.build_chip_to_host leads to memory accumulation (also in keeping with what xchoo mentioned).

I focused on build_host_to_chip and found that in connection.build_full_chip_connection the lines starting at 669 caused the issue.

    ax = Axon(mid_obj.n_neurons, label="neuron_weights")
    ax.target = syn
    ax.set_compartment_axon_map(target_axons)
    mid_obj.add_axon(ax)

So it seems that at least when connecting to the board (And I would assume vice versa holds) the axon/synapse that connects to the board is what is not being released from memory. Editing self.reset() to be

        def reset_nengo(self):
            for items in self.sim.model.objs:
                try:
                    self.sim.model.objs[items]["in"].axons = None
                except:
                    pass

                try:
                    self.sim.model.objs[items]["out"].axons = None
                except:
                    pass

            self.sim.close()
            self.initialize_nengo()

then helped eliminate most (but not all) of the memory accumulation when I ran the MWE without calling run_steps. If I add in calls to run_steps, there continues to be substantial memory leakage (though not quite as much).

Interestingly, if you try self.sim.model.objs[items]["in"]= None the problem persists. So it seems that some memory allocations, such as self.sim.model.objs[items]["in"].axons are referenced elsewhere and so do not get released unless explicitly removed. I played around with explicitly deleting various parts of sim.model but could not find a solution. Hopefully, this helps someone more knowledgeable of nengo_loihi's data structrues.

kshivvy commented 3 years ago

@xchoo We were wondering if you or other members of the Nengo team had any solutions for this memory leak? Our training method requires us to run and reset an ensemble multiple times but the memory usage makes our method unfeasible.

Thanks for your help!

hunse commented 3 years ago

I've been able to solve the memory leak that @nmnaughton found. The fact that you had traced it to a problematic line was very helpful. The fix is in #312.

However, there's still a memory leak when run_steps is called, and it's much bigger I think. See my test script in #312. I don't have time to look at it more right now, but if any of you have time to track it down, that would certainly help to fix it more quickly.

hunse commented 3 years ago

Ok, I had an idea, so I was able to quickly fix another one.

My guess is that there's also one in nengo_loihi/probe.py, with Probe.target, similar to the one with Axon.target. So switching over to weakrefs there might also help, but it's a bit more complicated because we also touch that in the nengo_loihi/builder/split_blocks.py.

EDIT: I tried using weakref for Probe.target, but it didn't help memory noticeably. So more work is needed to track down the leak when running.

hunse commented 3 years ago

I tracked down the last one with the help of tracemalloc. It's now fixed in #312. (There are still some leaks, but they appear to be very minor now.)

kshivvy commented 3 years ago

I tracked down the last one with the help of tracemalloc. It's now fixed in #312. (There are still some leaks, but they appear to be very minor now.)

@hunse That remaining memory leak seems to impact performance when steps=2000 and a more complicated network is used in our use case. Is it possible to resolve that remaining memory leak as well?

hunse commented 3 years ago

Can you provide an example that reproduces the memory leak? Looking into it is on our backlog, but since NengoLoihi is a free package currently without external funding, we don't have much time to work on it.

hunse commented 2 years ago

Hi @kshivvy. Not sure if you're still interested, but I think I've found and fixed the final leak now in the fix-memory-leak branch. We've got a push on right now to do some NengoLoihi stuff, so it should be merged in to master in the next week or two.

If you find any more memory leaks, feel free to post an example network/simulator so that I can reproduce them.

kshivvy commented 2 years ago

Hi @hunse- Thanks for finding the final memory leak! It will definitely help the use case @nmnaughton and I have. We'll keep an eye out for when the fix is merged.

hunse commented 2 years ago

The fix is now merged, so you can try it out on master branch. If you have any more issues (memory leaks or otherwise), feel free to open a new issue.