Open JuliaA369 opened 2 years ago
Thank you for reporting this, we need to look into it.
We are currently also working on replacing the multiprocessing library with a more sophisticated solution, which might also help with this problem.
I took a look and it's difficult to reproduce this behavior for me. The simulation gets slower and slower and after only a couple hundred of iterations it's unbearably slow. So, I can at least agree that there is something going wrong which could be a memory issue.
From the description of the problem, it looks like the threads are not getting properly closed. I added print("active threads", threading.active_count())
in the loop of the test script above and saw that the number of active threads is indeed increasing.
active threads 1
k = 0
active threads 3
k = 1
active threads 5
k = 2
active threads 7
k = 3
active threads 9
k = 4
active threads 11
k = 5
I found that adding self.send(np.zeros(self._shape))
in line 134 under self._done = True
in pypychannel.py and self.recv()
in line 282 helps in the sense that the number of active threads is not increasing anymore. Unfortunately, the speed of execution is still decreasing over iterations.
This is a known issue with Python's shared memory implementation leaking file descriptors and eventually OS throwing this error. We are working on a C++ based Shared Memory implementation and a design overall of the message passing architecture keeping the Channel APIs intact (no user code change). That should fix this issue. I don't have a date currently when it will be merged.
Based on discussion yesterday, it sounds like 3 weeks the merge will be done and we hope to make it part of the next release.
May I add that there was development in respect of setting weights during runtime. So now you can initialize your network once and set the weights each iteration instead of re-creating the complete network every time. This will speed up execution time drastically and also avoids encountering the Python memory issue.
The script above would look like this:
import numpy as np
from lava.magma.core.run_conditions import RunSteps
from lava.proc.lif.process import LIF
from lava.proc.dense.process import Dense
from lava.magma.core.run_configs import Loihi1SimCfg
num_steps = 5000
du = 10
dv = 100
vth = 4900
if __name__ == "__main__":
# Create processes
lif1 = LIF(shape=(3, ),
vth=vth,
dv=dv,
du=du,
bias_mant=(1, 3, 2),
name="lif1")
dense = Dense(weights=np.random.randint(1, 10, (2, 3)), name='dense')
lif2 = LIF(shape=(2, ),
vth=vth,
dv=dv,
du=du,
bias_mant=0,
name='lif2')
lif1.s_out.connect(dense.s_in)
dense.a_out.connect(lif2.a_in)
for k in range(num_steps):
if k > 0:
dense.weights.set(np.random.randint(1, 10, (2, 3)))
lif2.run(condition=RunSteps(num_steps=10),
run_cfg=Loihi1SimCfg(select_tag="fixed_pt"))
print("k = "+str(k), dense.weights.get())
lif2.stop()
Oh nice! Great news! Is this part of main?
yes.
The only other change you need to do in this script to make it truly equivalent to the one you posted initially is to reset the states (u and v) of the LIF neurons in each iteration.
Wow, that is a lot faster! For LearningDense is there or will there be a similar way to update dw?
As a future "nice to have" suggestion, it would be nice if the built in Lava processes had a reset() function that automatically reset all the internal states to default values.
Objective of issue: Allow network to be re-instantiated and run an infinite number of times without memory issues so that lava can be used for off-line training.
Lava version:
I'm submitting a ...
Current behavior:
Expected behavior:
Steps to reproduce:
Related code:
Other information:
self.join()
doesn't appear to be working properly, and the runtime services don't get killed (when I run in debug mode). However if I putself.join()
immediately belowif self._is_started:
, then the runtime service threads get terminated appropriately. I'm not sure why this is.The output error I'm seeing (before the fix from 2) with 0.4.0 is:
The output error I'm seeing (before the fix from 2) with 0.3.0 is:
Then here is the error with 0.4.0 after the fix on runtime.py