Open nmnaughton opened 2 years ago
I was able to track down the source of the memory usage. There were actually two parts, both related to passing spikes to and from the chip. Essentially in both cases, the full history of the spikes too and from the chip is saved. As a dict for passing to the chip and an array for passing back to host. As a quick hack, editing simulator.py
to clear the relevant entries after each run_steps
causes the memory usage to be constant (and ~10x smaller), however, this is probably not the correct final solution and it will also change the code behavior.
def emu_precomputed_host_pre_and_host(self, steps):
self.timers.start("run")
self.host_pre.run_steps(steps)
self._host2chip(self.emulator)
self.emulator.run_steps(steps)
self._chip2host(self.emulator)
self.host.run_steps(steps)
for probe, receiver in self.probes_receivers.items():
receiver.queue.clear()
receiver.queue_index = 0
for sender, receiver in self.model.host2chip_senders.items():
self.model.spike_targets[receiver].spikes={}
self.timers.stop("run")
EDIT: Just a note, if you do not use precompute=True then there are still large memory leaks, but that seems like a more niche case. I only realized it due to a typo.
Thanks for tracking those down, @nmnaughton. I was aware of the one with the receiver queue but hadn't got around to fixing it yet, but I wasn't aware of the one with the spike targets and I think that's the more significant one.
I've addressed them both in #324. Note that the fix you've got above no longer works with master
branch, because we've changed how we manage sender and receiver queues as part of adding the clear_probes
function to Simulator
. Also, you might want to check out that clear_probes
function, as probes recording over long periods can also use quite a bit of memory.
At some point, I'll have to look more into the actual memory leaks. One thing I'll note about your script is that while your Simulator
object gets closed, it doesn't get deleted until the sim
variable goes out of scope, which is after the gc.collect()
call. A closed Simulator
still maintains all its probe data; this is intentional so that you can close a simulator (and disconnect from the Loihi board), but still use the data you've collected. It should use about the same amount of memory whether it's run all at once or with multiple run_steps
calls, though, so if there are differences there, that's definitely indicative of something. We'll want to re-run this on the current master
branch, though, because we've made a number of changes to how data is probed and stored as part of fixing the earlier memory leaks and adding the clear_probes
function. So some of these issues may be resolved.
I tried measuring things with a modification of your script (below), which uses heapy to measure the memory usage. I'm not seeing any significant memory leaks (the difference between start and end is on the order of a few hundred kB), and not anything that depends on either the number of times run_steps
is called or on the total number of steps. This is the case even when I test on NengoLoihi v1.0.0; I don't see any difference between case 1 and case 4. This leads me to question how much top
can be trusted for these measurements; i.e. when a process frees up memory, I'm not sure if this is reflected in the memory usage reported by top
.
Of course, the peak memory usage is certainly dependent on the number of steps. Hopefully, with the changes in #324, that is much less significant.
import gc
import nengo
import numpy as np
import nengo_loihi
use_tracemalloc = False
# use_tracemalloc = True
if use_tracemalloc:
import tracemalloc
tracemalloc.start()
tracemalloc.start(25)
else:
from guppy import hpy
h = hpy()
def snapshot():
if use_tracemalloc:
return tracemalloc.take_snapshot()
else:
return h.heap()
def print_snapshot(snap):
if use_tracemalloc:
if (
isinstance(snap, list)
and len(snap) > 0
and isinstance(snap[0], tracemalloc.StatisticDiff)
):
for stat in snap[:20]:
print(stat)
# print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for i, line in enumerate(stat.traceback.format()):
print(line)
if i > 5:
break
else:
print(snap)
else:
print(snap)
def snapshot_diff(snap0, snap1):
if use_tracemalloc:
return snap1.compare_to(snap0, "traceback")
else:
return snap1 - snap0
class NewClass:
def __init__(self):
self.input_size = 10
self.n_neurons = 1024
self.initialize_nengo()
def initialize_nengo(self):
network = nengo.Network()
with network:
def input_func(t):
return np.ones(self.input_size)
def output_func(t, x):
self.output = x
input_layer = nengo.Node(
output=input_func, size_in=0, size_out=self.input_size
)
ensemble = nengo.Ensemble(
n_neurons=self.n_neurons,
dimensions=1,
)
output_layer = nengo.Node(
output=output_func, size_in=self.n_neurons, size_out=0
)
conn_in = nengo.Connection(
input_layer,
ensemble.neurons,
transform=np.ones((self.n_neurons, self.input_size)),
)
conn_out = nengo.Connection(ensemble.neurons, output_layer, synapse = None)
self.network = network
def run(self, case, num_resets):
for i in range(num_resets):
with nengo_loihi.Simulator(self.network, precompute=False) as sim:
if case == 1:
for _ in range(1000):
sim.run_steps(10)
elif case == 2:
for _ in range(10):
sim.run_steps(1000)
elif case == 3:
for _ in range(2):
sim.run_steps(5000)
elif case == 4:
for _ in range(1):
sim.run_steps(10000)
del sim
print('finished iter', i+1)
gc.collect()
nengo_class = NewClass()
snap0 = snapshot()
# nengo_class.run(case=1, num_resets=1)
nengo_class.run(case=4, num_resets=1)
snap1 = snapshot()
snapd = snapshot_diff(snap0, snap1)
print_snapshot(snapd)
This leads me to question how much
top
can be trusted for these measurements; i.e. when a process frees up memory, I'm not sure if this is reflected in the memory usage reported bytop
.
Processes in general can free up memory, but Python never releases memory back to the operating system. Python manages its own memory heap, so if you continue to allocate memory, it'll keep asking the OS to give it more memory. If you stop needing memory and garbage collect it in Python, that memory will now be free for the Python process to use in the future. It never gets released back to the OS, so the memory used by a Python process is generally a bit above the maximum amount of memory that Python process has ever needed.
Thanks @tbekolay, that's good to know.
There is still the question of why @nmnaughton sees higher memory usage in top
in case 1 versus case 4. In terms of things like the spike queues filling up, these two cases should be the same. Maybe there's something else that causes the case with multiple run_steps
calls to use higher peak memory?
Anyway, hopefully just reducing the amount of memory used during run (through the fixes in #324) is good enough for now.
So I think overall I actually saw the same memory usage between the different cases, the difference was that in case 1, the memory was accumulated and maintained by the python process while in case 4 there was a quick spike in memory when the spike queue was dotted with the weights, but then that memory appeared to be released back to the system according to top (not sure how that squares with @tbekolay's point though).
This was an issue because I was running on a cluster with multiple simulations and so if each simulation was always trying to reserve this 'peak' memory usage, the combined memory usage would cause SLURM to throw an out-of-memory error. If instead each simulation only had a short spike in memory usage at the end of each run, that was not as much of an issue since the simulations did not all end at the same time.
Either way, it looks like #324 address the issue. Thanks for you help on this!
Hi, following up on #311. I tried using the new code on the master branch and was still having issues. Playing around with the test script from #312 I think the issue is that when
sim.run_steps()
is called many times in a row a memory leak occurs. I am defining a memory leak here to be the resident memory associated with the process as seen intop
. In the below code, when the for loop is called many times in a row, after each.run_steps()
, a section of memory is taken up proportional to the number of time steps just called. This stacks up throughout the run and can result in a lot of memory. If.run_steps()
is only called a few times, then this memory gets released whengc.collect()
is called. If a lot of.run_steps()
are called, then this memory does not get released correctly and stays reserved (Case 1 vs Case 2 below).So I think that the lack of releasing the memory is likely a bug, but is it also possible to not have that memory reserved at all? In my use case, I am trying to run a reservoir for a long time, and even if the memory is correctly released at the end of the simulation, currently it takes up ~20 Gb of memory due to the length of the simulation and I can not scale the network any further. I have no probes in my simulation and do not need any past information about the reservoir, but currently the memory requirement is proportional to how many time steps I have taken, which seems like unwanted behavior.
Thanks!