Memory error after reinitializing and stopping a 2 LIF proccess network multiple times

JuliaA369 commented 2 years ago

Objective of issue: Allow network to be re-instantiated and run an infinite number of times without memory issues so that lava can be used for off-line training.

Lava version:

[ ] 0.4.0
[ ] 0.3.0

I'm submitting a ...

[ x] bug report

Current behavior:

After reinitializing and running a network about 2000 times I get a memory error. If I am trying to recreate the network with new weights during a training, I wouldn't be able to train for sufficient iterations.

Expected behavior:

I'd expect that we should be able to theoretically reinitialize and run the network an infinite number of times without seeing a memory error.

Steps to reproduce:

Run the minimal example below.

Related code:

import numpy as np
from lava.magma.core.run_conditions import RunSteps
from lava.proc.lif.process import LIF
from lava.proc.dense.process import Dense
from lava.magma.core.run_configs import Loihi1SimCfg

num_steps = 5000
du = 10
dv = 100
vth = 4900
if __name__ == "__main__":

    for k in range(num_steps):
        # Create processes
        lif1 = LIF(shape=(3, ),                         
                vth=vth,                             
                dv=dv,                              
                du=du,                              
                bias_mant=(1, 3, 2),           
                name="lif1")

        dense = Dense(weights=np.random.rand(2, 3), name='dense')

        lif2 = LIF(shape=(2, ),                        
                vth=vth,                             
                dv=dv,                              
                du=du,                              
                bias_mant=0,                        
                name='lif2')

        lif1.s_out.connect(dense.s_in)
        dense.a_out.connect(lif2.a_in)

        lif2.run(condition=RunSteps(num_steps=10),
            run_cfg=Loihi1SimCfg(select_tag="fixed_pt"))

        lif2.stop()
        print("k = "+str(k))

Other information:

I see an error at around the same point for both for 0.3.0 and 0.4.0, although the error output is a bit different.
I noticed that inside of stop() of runtime.py, self.join() doesn't appear to be working properly, and the runtime services don't get killed (when I run in debug mode). However if I put self.join() immediately below if self._is_started:, then the runtime service threads get terminated appropriately. I'm not sure why this is.
The fix from 2 allows a few hundred more iterations of the the code before getting a memory issue, but I still see the memory error eventually.

The output error I'm seeing (before the fix from 2) with 0.4.0 is:

k = 1883
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jashkanazy/dev/lava_tests/simple_thread_test.py", line 34, in <module>
    lif2.run(condition=RunSteps(num_steps=10),
  File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/core/process/process.py", line 343, in run
    self._runtime.initialize()
  File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/runtime/runtime.py", line 144, in initialize
    self._start_ports()
  File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/runtime/runtime.py", line 154, in _start_ports
    port.start()
  File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/compiler/channels/pypychannel.py", line 240, in start
    self.thread.start()
  File "/usr/lib/python3.8/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

The output error I'm seeing (before the fix from 2) with 0.3.0 is:

k = 1881
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    cache[rtype].remove(name)
KeyError: '/psm_49e392f1'
    exec(code, run_globals)
  File "/home/jashkanazy-local/dev/sllml/SLLML/src/snn-algos/examples/compare_loihi_to_lava/simple_thread/simple_thread.py", line 78, in <module>
  File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/core/process/process.py", line 422, in run
    self._runtime.initialize()
  File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/runtime/runtime.py", line 138, in initialize
    self._build_sync_channels()
  File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/runtime/runtime.py", line 199, in _build_sync_channels
    channel: Channel = sync_channel_builder.build(
  File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/compiler/builders/builder.py", line 747, in build
    return channel_class(
  File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/compiler/channels/pypychannel.py", line 337, in __init__
    shm = smm.SharedMemory(int(nbytes * size))
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 1386, in SharedMemory
    sms = shared_memory.SharedMemory(None, create=True, size=size)
  File "/usr/lib/python3.8/multiprocessing/shared_memory.py", line 113, in __init__
    self._mmap = mmap.mmap(self._fd, size)
OSError: [Errno 12] Cannot allocate memory

Then here is the error with 0.4.0 after the fix on runtime.py

k = 2134
Process SystemProcess-10677:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
    util._run_after_forkers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
    items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10678:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
    util._run_after_forkers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
    items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10680:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
    util._run_after_forkers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
    items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10679:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
    util._run_after_forkers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
    items = list(_afterfork_registry.items())
MemoryError

PhilippPlank commented 2 years ago

Thank you for reporting this, we need to look into it.

We are currently also working on replacing the multiprocessing library with a more sophisticated solution, which might also help with this problem.

weidel-p commented 2 years ago

I took a look and it's difficult to reproduce this behavior for me. The simulation gets slower and slower and after only a couple hundred of iterations it's unbearably slow. So, I can at least agree that there is something going wrong which could be a memory issue.

From the description of the problem, it looks like the threads are not getting properly closed. I added print("active threads", threading.active_count()) in the loop of the test script above and saw that the number of active threads is indeed increasing.

active threads 1
k = 0
active threads 3
k = 1
active threads 5
k = 2
active threads 7
k = 3
active threads 9
k = 4
active threads 11
k = 5

I found that adding self.send(np.zeros(self._shape)) in line 134 under self._done = True in pypychannel.py and self.recv() in line 282 helps in the sense that the number of active threads is not increasing anymore. Unfortunately, the speed of execution is still decreasing over iterations.

joyeshmishra commented 2 years ago

This is a known issue with Python's shared memory implementation leaking file descriptors and eventually OS throwing this error. We are working on a C++ based Shared Memory implementation and a design overall of the message passing architecture keeping the Channel APIs intact (no user code change). That should fix this issue. I don't have a date currently when it will be merged.

joyeshmishra commented 2 years ago

Based on discussion yesterday, it sounds like 3 weeks the merge will be done and we hope to make it part of the next release.

weidel-p commented 2 years ago

May I add that there was development in respect of setting weights during runtime. So now you can initialize your network once and set the weights each iteration instead of re-creating the complete network every time. This will speed up execution time drastically and also avoids encountering the Python memory issue.

The script above would look like this:

import numpy as np
from lava.magma.core.run_conditions import RunSteps
from lava.proc.lif.process import LIF
from lava.proc.dense.process import Dense
from lava.magma.core.run_configs import Loihi1SimCfg

num_steps = 5000
du = 10
dv = 100
vth = 4900
if __name__ == "__main__":
    # Create processes
    lif1 = LIF(shape=(3, ),                         
            vth=vth,                             
            dv=dv,                              
            du=du,                              
            bias_mant=(1, 3, 2),           
            name="lif1")

    dense = Dense(weights=np.random.randint(1, 10, (2, 3)), name='dense')

    lif2 = LIF(shape=(2, ),                        
            vth=vth,                             
            dv=dv,                              
            du=du,                              
            bias_mant=0,                        
            name='lif2')

    lif1.s_out.connect(dense.s_in)
    dense.a_out.connect(lif2.a_in)

    for k in range(num_steps):

        if k > 0:
            dense.weights.set(np.random.randint(1, 10, (2, 3)))        

        lif2.run(condition=RunSteps(num_steps=10),
            run_cfg=Loihi1SimCfg(select_tag="fixed_pt"))

        print("k = "+str(k), dense.weights.get())

    lif2.stop()

JuliaA369 commented 2 years ago

Oh nice! Great news! Is this part of main?

weidel-p commented 2 years ago

yes.

The only other change you need to do in this script to make it truly equivalent to the one you posted initially is to reset the states (u and v) of the LIF neurons in each iteration.

JuliaA369 commented 1 year ago

Wow, that is a lot faster! For LearningDense is there or will there be a similar way to update dw?

As a future "nice to have" suggestion, it would be nice if the built in Lava processes had a reset() function that automatically reset all the internal states to default values.

lava-nc / lava

Memory error after reinitializing and stopping a 2 LIF proccess network multiple times #336