Open wookayin opened 5 years ago
Every new Lab
instance results in a new copy of the DSO on disk (see dmlab_so_loader.cc). The copy is immediately unlinked, so it has no name, but the file remains as long as it's open, which is until you destroy the environment.
Could that explain what you're seeing?
In the example code above we are destroying the environment with this line: env.close()
. I think all the internal resources (i.e. files) should be released when we close the environment. However, I could see nevertheless they are not freed, hence a leak.
Hm, that's right. It's possible that there's some leaky code in ioq3 itself, where files are opened that are never closed. I haven't noticed anything like that yet, but I've also not looked very carefully. Maybe you can run this through strace
or a debugger and find out what's being opened?
Or maybe you could increase your resource limits (ulimit
)?
Have you taken a look at the lsof
log above? What is being increased linearly is the following files (i.e. every iteration):
python3.6/site-packages/deepmind_lab/baselab/assets.pk3
python3.6/site-packages/deepmind_lab/baselab/assets_bots.pk3
python3.6/site-packages/deepmind_lab/baselab/assets_oa.pk3
python3.6/site-packages/deepmind_lab/baselab/vm.pk3
From a strace
log I can see those files are opened but not closed. Sometimes opened and then closed, so there must be somewhere missed closing. But it's quite hard to backtrace from where it is being opened and closed... do you have any suggestion on how to nail down the execution trace?
I think the code for loading those PK3s is all part of ioq3, so it's quite possible that that's just leaking. (It was never designed to be hosted like we use it.) You could try running in a debugger, breaking at the syscall and then finding out where that's happening. Once we have that, we could look in the code for obvious omissions of cleanup.
Okay, I will try to use debuggers to identify the leak and let you know when some clue is found. Thanks!
For anyone still have this trouble, you can just wrap the loop content with a subprocess, and this error will disappear.
Consider the following code --- We generate 1000 episodes, each of which is made of 100 steps. For each episode a
deempind_lab.Lab
instance is created and closed.It should run without an error, but the number of file descriptor keeps increasing until the process dies with an error: too many open files (or Segmentation Fault if the native code fails to open a file).
In my linux system, it was killed at around 250 episodes; at the moment the number of open files (
lsof | grep 8922 | wc -l
) was approximately 4000. FYI,ulimit -n
gives 1024.What happens?
The content of
lsof
is filled with:as well as a number of
One can easily see these lines are proliferative when comparing two snapshots of
lsof
outputs.To sum, file descriptor of
/dev/zero
and resource files(*.pk3)
are created and not properly freed. This would be an obvious resource leak. I haven't looked into where they are opened from.P.S. For your information, the reason why I was trying to create new
Lab
instances every time instead of reusing one instance throughout the process includes #133 and #134.