nengo / nengo-loihi

Run Nengo models on Intel's Loihi chip
https://www.nengo.ai/nengo-loihi/
Other
35 stars 12 forks source link

Possible memory leak for multiple run_steps() #323

Open nmnaughton opened 2 years ago

nmnaughton commented 2 years ago

Hi, following up on #311. I tried using the new code on the master branch and was still having issues. Playing around with the test script from #312 I think the issue is that when sim.run_steps() is called many times in a row a memory leak occurs. I am defining a memory leak here to be the resident memory associated with the process as seen in top. In the below code, when the for loop is called many times in a row, after each .run_steps(), a section of memory is taken up proportional to the number of time steps just called. This stacks up throughout the run and can result in a lot of memory. If .run_steps() is only called a few times, then this memory gets released when gc.collect() is called. If a lot of .run_steps() are called, then this memory does not get released correctly and stays reserved (Case 1 vs Case 2 below).

So I think that the lack of releasing the memory is likely a bug, but is it also possible to not have that memory reserved at all? In my use case, I am trying to run a reservoir for a long time, and even if the memory is correctly released at the end of the simulation, currently it takes up ~20 Gb of memory due to the length of the simulation and I can not scale the network any further. I have no probes in my simulation and do not need any past information about the reservoir, but currently the memory requirement is proportional to how many time steps I have taken, which seems like unwanted behavior.

Thanks!

import gc
import weakref

import nengo
import numpy as np

import nengo_loihi

class NewClass:
    def __init__(self):
        self.input_size = 10
        self.n_neurons = 1024
        self.initialize_nengo()

    def initialize_nengo(self):
        network = nengo.Network()
        with network:

            def input_func(t):
                return np.ones(self.input_size)

            def output_func(t, x):
                self.output = x

            input_layer = nengo.Node(
                output=input_func, size_in=0, size_out=self.input_size
            )

            ensemble = nengo.Ensemble(
                n_neurons=self.n_neurons, 
                dimensions=1, 
            )

            output_layer = nengo.Node(
                output=output_func, size_in=self.n_neurons, size_out=0
            )

            conn_in = nengo.Connection(
                input_layer,
                ensemble.neurons,
                transform=np.ones((self.n_neurons, self.input_size)),
            )
            conn_out = nengo.Connection(ensemble.neurons, output_layer, synapse = None)

        self.network = network

    def run(self, case, num_resets):
        for i in range(num_resets):
            with nengo_loihi.Simulator(self.network, precompute=True) as sim:
                if case == 1:
                    """ Memory increases throughout the run and at the end the total memory that has been
                    reserved up to this point is maintained for future simulations and not released. 
                    The amount of memory is not increased if future simulations have the same number
                    of run_steps. If they have more, then the memory starts to increase again once
                    more run_steps have been called. """
                    for _ in range(10000):
                    # for _ in range(10000 * (i+1)):  #to demonstrate what happens for increasing run_steps.
                        sim.run_steps(10)
                elif case == 2:
                    """ Memory will increase like in case 1, but at the end of the simulation the memory 
                    is correctly released and starts accumulated over again. """
                    for _ in range(10):
                        sim.run_steps(10000)
                elif case == 3:
                    """ Demonstration of the memory increasing in jumps after each run_steps"""
                    for _ in range(2):
                        sim.run_steps(50000)
                        input('run_step done, Press Enter')
                elif case == 4:
                    """ Baseline case. Memory usage is consistently low except for a brief spike 
                    at the very end of the run_step when it spikes but then drops back down 
                    because gc.collect is called and correctly releases the memory."""
                    for _ in range(1):
                        sim.run_steps(100000)
                        input('run_step done, Press Enter')

            print('finished iter', i+1)
            gc.collect() 
            #RAM will accumulate until gc is manually called. 
            #Has to be called outside of 'with' code block to release memory.

num_resets = 3
nengo_class = NewClass()
case = 4
nengo_class.run(case, num_resets)
nmnaughton commented 2 years ago

I was able to track down the source of the memory usage. There were actually two parts, both related to passing spikes to and from the chip. Essentially in both cases, the full history of the spikes too and from the chip is saved. As a dict for passing to the chip and an array for passing back to host. As a quick hack, editing simulator.py to clear the relevant entries after each run_steps causes the memory usage to be constant (and ~10x smaller), however, this is probably not the correct final solution and it will also change the code behavior.

    def emu_precomputed_host_pre_and_host(self, steps):
        self.timers.start("run")
        self.host_pre.run_steps(steps)
        self._host2chip(self.emulator)
        self.emulator.run_steps(steps)
        self._chip2host(self.emulator)
        self.host.run_steps(steps)

        for probe, receiver in self.probes_receivers.items():
            receiver.queue.clear()
            receiver.queue_index = 0

        for sender, receiver in self.model.host2chip_senders.items():
            self.model.spike_targets[receiver].spikes={}

        self.timers.stop("run")

EDIT: Just a note, if you do not use precompute=True then there are still large memory leaks, but that seems like a more niche case. I only realized it due to a typo.

hunse commented 2 years ago

Thanks for tracking those down, @nmnaughton. I was aware of the one with the receiver queue but hadn't got around to fixing it yet, but I wasn't aware of the one with the spike targets and I think that's the more significant one.

I've addressed them both in #324. Note that the fix you've got above no longer works with master branch, because we've changed how we manage sender and receiver queues as part of adding the clear_probes function to Simulator. Also, you might want to check out that clear_probes function, as probes recording over long periods can also use quite a bit of memory.

hunse commented 2 years ago

At some point, I'll have to look more into the actual memory leaks. One thing I'll note about your script is that while your Simulator object gets closed, it doesn't get deleted until the sim variable goes out of scope, which is after the gc.collect() call. A closed Simulator still maintains all its probe data; this is intentional so that you can close a simulator (and disconnect from the Loihi board), but still use the data you've collected. It should use about the same amount of memory whether it's run all at once or with multiple run_steps calls, though, so if there are differences there, that's definitely indicative of something. We'll want to re-run this on the current master branch, though, because we've made a number of changes to how data is probed and stored as part of fixing the earlier memory leaks and adding the clear_probes function. So some of these issues may be resolved.

hunse commented 2 years ago

I tried measuring things with a modification of your script (below), which uses heapy to measure the memory usage. I'm not seeing any significant memory leaks (the difference between start and end is on the order of a few hundred kB), and not anything that depends on either the number of times run_steps is called or on the total number of steps. This is the case even when I test on NengoLoihi v1.0.0; I don't see any difference between case 1 and case 4. This leads me to question how much top can be trusted for these measurements; i.e. when a process frees up memory, I'm not sure if this is reflected in the memory usage reported by top.

Of course, the peak memory usage is certainly dependent on the number of steps. Hopefully, with the changes in #324, that is much less significant.

import gc

import nengo
import numpy as np

import nengo_loihi

use_tracemalloc = False
# use_tracemalloc = True

if use_tracemalloc:
    import tracemalloc

    tracemalloc.start()
    tracemalloc.start(25)

else:
    from guppy import hpy

    h = hpy()

def snapshot():
    if use_tracemalloc:
        return tracemalloc.take_snapshot()
    else:
        return h.heap()

def print_snapshot(snap):
    if use_tracemalloc:
        if (
            isinstance(snap, list)
            and len(snap) > 0
            and isinstance(snap[0], tracemalloc.StatisticDiff)
        ):
            for stat in snap[:20]:
                print(stat)
                # print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
                for i, line in enumerate(stat.traceback.format()):
                    print(line)
                    if i > 5:
                        break

        else:
            print(snap)
    else:
        print(snap)

def snapshot_diff(snap0, snap1):
    if use_tracemalloc:
        return snap1.compare_to(snap0, "traceback")
    else:
        return snap1 - snap0

class NewClass:
    def __init__(self):
        self.input_size = 10
        self.n_neurons = 1024
        self.initialize_nengo()

    def initialize_nengo(self):
        network = nengo.Network()
        with network:

            def input_func(t):
                return np.ones(self.input_size)

            def output_func(t, x):
                self.output = x

            input_layer = nengo.Node(
                output=input_func, size_in=0, size_out=self.input_size
            )

            ensemble = nengo.Ensemble(
                n_neurons=self.n_neurons,
                dimensions=1,
            )

            output_layer = nengo.Node(
                output=output_func, size_in=self.n_neurons, size_out=0
            )

            conn_in = nengo.Connection(
                input_layer,
                ensemble.neurons,
                transform=np.ones((self.n_neurons, self.input_size)),
            )
            conn_out = nengo.Connection(ensemble.neurons, output_layer, synapse = None)

        self.network = network

    def run(self, case, num_resets):
        for i in range(num_resets):
            with nengo_loihi.Simulator(self.network, precompute=False) as sim:
                if case == 1:
                    for _ in range(1000):
                        sim.run_steps(10)
                elif case == 2:
                    for _ in range(10):
                        sim.run_steps(1000)
                elif case == 3:
                    for _ in range(2):
                        sim.run_steps(5000)
                elif case == 4:
                    for _ in range(1):
                        sim.run_steps(10000)

            del sim

            print('finished iter', i+1)
            gc.collect()

nengo_class = NewClass()
snap0 = snapshot()
# nengo_class.run(case=1, num_resets=1)
nengo_class.run(case=4, num_resets=1)
snap1 = snapshot()
snapd = snapshot_diff(snap0, snap1)
print_snapshot(snapd)
tbekolay commented 2 years ago

This leads me to question how much top can be trusted for these measurements; i.e. when a process frees up memory, I'm not sure if this is reflected in the memory usage reported by top.

Processes in general can free up memory, but Python never releases memory back to the operating system. Python manages its own memory heap, so if you continue to allocate memory, it'll keep asking the OS to give it more memory. If you stop needing memory and garbage collect it in Python, that memory will now be free for the Python process to use in the future. It never gets released back to the OS, so the memory used by a Python process is generally a bit above the maximum amount of memory that Python process has ever needed.

hunse commented 2 years ago

Thanks @tbekolay, that's good to know.

There is still the question of why @nmnaughton sees higher memory usage in top in case 1 versus case 4. In terms of things like the spike queues filling up, these two cases should be the same. Maybe there's something else that causes the case with multiple run_steps calls to use higher peak memory?

Anyway, hopefully just reducing the amount of memory used during run (through the fixes in #324) is good enough for now.

nmnaughton commented 2 years ago

So I think overall I actually saw the same memory usage between the different cases, the difference was that in case 1, the memory was accumulated and maintained by the python process while in case 4 there was a quick spike in memory when the spike queue was dotted with the weights, but then that memory appeared to be released back to the system according to top (not sure how that squares with @tbekolay's point though).

This was an issue because I was running on a cluster with multiple simulations and so if each simulation was always trying to reserve this 'peak' memory usage, the combined memory usage would cause SLURM to throw an out-of-memory error. If instead each simulation only had a short spike in memory usage at the end of each run, that was not as much of an issue since the simulations did not all end at the same time.

Either way, it looks like #324 address the issue. Thanks for you help on this!