Closed nmnaughton closed 2 years ago
I've confirmed this issue also occurs on windows. And anecdotally I have noticed that when run on loihi hardware the simulation time on the chip as well as the time to transfer data from chip2host increases the more times I reset and rerun, which may or may not be the same.
I can confirm this to be the case. It seems related to the amount of data being transferred on and off the chip (even if it's just the emulator running). Removing the connection to the output layer reduces the amount of memory leak, but it still leaks. Changing the number of neurons (i.e., increasing the dimensionality of the chip output) or increasing the number of timesteps increases the amount of memory leakage.
Converting the code to use the python with
context block didn't affect the amount of leakage as well.
Memory leak doesn't happen when using the core Nengo simulator, so it seems to be NengoLoihi specific.
I dug into this a bit more. As xchoo said, increasing the number of timesteps increases the leakage, however, I realized that even if you do not call run_step(), you get leakage. So it appears that just creating the network multiple times leads to the leakage. Following this, I dug through the initalizations and found calling connection.build_host_to_chip
and connection.build_chip_to_host
leads to memory accumulation (also in keeping with what xchoo mentioned).
I focused on build_host_to_chip
and found that in connection.build_full_chip_connection
the lines starting at 669 caused the issue.
ax = Axon(mid_obj.n_neurons, label="neuron_weights")
ax.target = syn
ax.set_compartment_axon_map(target_axons)
mid_obj.add_axon(ax)
So it seems that at least when connecting to the board (And I would assume vice versa holds) the axon/synapse that connects to the board is what is not being released from memory. Editing self.reset() to be
def reset_nengo(self):
for items in self.sim.model.objs:
try:
self.sim.model.objs[items]["in"].axons = None
except:
pass
try:
self.sim.model.objs[items]["out"].axons = None
except:
pass
self.sim.close()
self.initialize_nengo()
then helped eliminate most (but not all) of the memory accumulation when I ran the MWE without calling run_steps. If I add in calls to run_steps, there continues to be substantial memory leakage (though not quite as much).
Interestingly, if you try self.sim.model.objs[items]["in"]= None
the problem persists. So it seems that some memory allocations, such as self.sim.model.objs[items]["in"].axons
are referenced elsewhere and so do not get released unless explicitly removed. I played around with explicitly deleting various parts of sim.model but could not find a solution. Hopefully, this helps someone more knowledgeable of nengo_loihi's data structrues.
@xchoo We were wondering if you or other members of the Nengo team had any solutions for this memory leak? Our training method requires us to run and reset an ensemble multiple times but the memory usage makes our method unfeasible.
Thanks for your help!
I've been able to solve the memory leak that @nmnaughton found. The fact that you had traced it to a problematic line was very helpful. The fix is in #312.
However, there's still a memory leak when run_steps
is called, and it's much bigger I think. See my test script in #312. I don't have time to look at it more right now, but if any of you have time to track it down, that would certainly help to fix it more quickly.
Ok, I had an idea, so I was able to quickly fix another one.
My guess is that there's also one in nengo_loihi/probe.py
, with Probe.target
, similar to the one with Axon.target
. So switching over to weakrefs there might also help, but it's a bit more complicated because we also touch that in the nengo_loihi/builder/split_blocks.py
.
EDIT: I tried using weakref for Probe.target
, but it didn't help memory noticeably. So more work is needed to track down the leak when running.
I tracked down the last one with the help of tracemalloc
. It's now fixed in #312. (There are still some leaks, but they appear to be very minor now.)
I tracked down the last one with the help of
tracemalloc
. It's now fixed in #312. (There are still some leaks, but they appear to be very minor now.)
@hunse That remaining memory leak seems to impact performance when steps=2000
and a more complicated network is used in our use case. Is it possible to resolve that remaining memory leak as well?
Can you provide an example that reproduces the memory leak? Looking into it is on our backlog, but since NengoLoihi is a free package currently without external funding, we don't have much time to work on it.
Hi @kshivvy. Not sure if you're still interested, but I think I've found and fixed the final leak now in the fix-memory-leak
branch. We've got a push on right now to do some NengoLoihi stuff, so it should be merged in to master
in the next week or two.
If you find any more memory leaks, feel free to post an example network/simulator so that I can reproduce them.
Hi @hunse- Thanks for finding the final memory leak! It will definitely help the use case @nmnaughton and I have. We'll keep an eye out for when the fix is merged.
The fix is now merged, so you can try it out on master
branch. If you have any more issues (memory leaks or otherwise), feel free to open a new issue.
Hello.
I am using the emulator in nengo_loihi v1.0.0 and am trying to reset and rerun an ensemble multiple times. When I do this, I notice that the memory used by the process quickly accumulates. If I run the same code using regular nengo I do not get this memory accumulation. It seems that maybe sim.close() does not release all the memory associated with the network? Could it be a probe issue? Below is a MWE. I am using python 3.7.7 on macOS 11.1. Any help diagnosing the problem would be appreciated. Thanks!