google-deepmind / lab

A customisable 3D platform for agent-based AI research
Other
7.13k stars 1.37k forks source link

File resources leak: too many open files #135

Open wookayin opened 5 years ago

wookayin commented 5 years ago

Consider the following code --- We generate 1000 episodes, each of which is made of 100 steps. For each episode a deempind_lab.Lab instance is created and closed.

It should run without an error, but the number of file descriptor keeps increasing until the process dies with an error: too many open files (or Segmentation Fault if the native code fails to open a file).

import numpy as np
import deepmind_lab
from tqdm import tqdm
import tempfile

for i in tqdm(range(1000)):
    env = deepmind_lab.Lab(
        'nav_maze_static_01',
        observations=[
            'RGB_INTERLEAVED', 'DEBUG.CAMERA_INTERLEAVED.TOP_DOWN',
            'DEBUG.POS.TRANS', 'DEBUG.POS.ROT', 'VEL.TRANS', 'VEL.ROT',
        ],
        renderer='hardware',
    )
    env.reset()
    action = np.zeros([7], dtype=np.intc)
    for t in range(100):
        env.step(action)
    env.close()

    with tempfile.TemporaryFile() as fp:
        pass     # too many open files?

In my linux system, it was killed at around 250 episodes; at the moment the number of open files (lsof | grep 8922 | wc -l) was approximately 4000. FYI, ulimit -n gives 1024.

$ python m.py
 25%|████████                        | 251/1000 [05:36<16:33,  1.33s/it]
[1]    8922 segmentation fault (core dumped)  python m.py

What happens?

The content of lsof is filled with:

...
python     8922  8929            (ID)  DEL       REG                0,5            384322126 /dev/zero
python     8922  8929            (ID)  DEL       REG                0,5            384322129 /dev/zero
python     8922  8929            (ID)  DEL       REG                0,5            384322148 /dev/zero
python     8922  8929            (ID)  DEL       REG                0,5            384322149 /dev/zero
...

as well as a number of

...
python     8922  8929            (ID)  811r      REG                8,3  55390150    6423400 /home/(ID)/.cache/bazel/_bazel_(ID)/91edf13a2436a0b47d7f3009b4283215/execroot/org_deepmind_lab/bazel-out/k8-opt/genfiles/baselab/assets.pk3
python     8922  8929            (ID)  812r      REG                8,3    195246   60031226 /home/(ID)/.cache/bazel/_bazel_(ID)/91edf13a2436a0b47d7f3009b4283215/execroot/org_deepmind_lab/bazel-out/k8-opt/genfiles/baselab/assets_bots.pk3
python     8922  8929            (ID)  813r      REG                8,3      9662   60031225 /home/(ID)/.cache/bazel/_bazel_(ID)/91edf13a2436a0b47d7f3009b4283215/execroot/org_deepmind_lab/bazel-out/k8-opt/genfiles/baselab/assets_oa.pk3
python     8922  8929            (ID)  814r      REG                8,3    371473    6815822 /home/(ID)/.cache/bazel/_bazel_(ID)/91edf13a2436a0b47d7f3009b4283215/execroot/org_deepmind_lab/bazel-out/k8-opt/genfiles/baselab/vm.pk3
...

One can easily see these lines are proliferative when comparing two snapshots of lsof outputs.

To sum, file descriptor of /dev/zero and resource files (*.pk3) are created and not properly freed. This would be an obvious resource leak. I haven't looked into where they are opened from.

P.S. For your information, the reason why I was trying to create new Lab instances every time instead of reusing one instance throughout the process includes #133 and #134.

tkoeppe commented 5 years ago

Every new Lab instance results in a new copy of the DSO on disk (see dmlab_so_loader.cc). The copy is immediately unlinked, so it has no name, but the file remains as long as it's open, which is until you destroy the environment.

Could that explain what you're seeing?

wookayin commented 5 years ago

In the example code above we are destroying the environment with this line: env.close(). I think all the internal resources (i.e. files) should be released when we close the environment. However, I could see nevertheless they are not freed, hence a leak.

tkoeppe commented 5 years ago

Hm, that's right. It's possible that there's some leaky code in ioq3 itself, where files are opened that are never closed. I haven't noticed anything like that yet, but I've also not looked very carefully. Maybe you can run this through strace or a debugger and find out what's being opened?

Or maybe you could increase your resource limits (ulimit)?

wookayin commented 5 years ago

Have you taken a look at the lsof log above? What is being increased linearly is the following files (i.e. every iteration):

python3.6/site-packages/deepmind_lab/baselab/assets.pk3
python3.6/site-packages/deepmind_lab/baselab/assets_bots.pk3
python3.6/site-packages/deepmind_lab/baselab/assets_oa.pk3
python3.6/site-packages/deepmind_lab/baselab/vm.pk3

From a strace log I can see those files are opened but not closed. Sometimes opened and then closed, so there must be somewhere missed closing. But it's quite hard to backtrace from where it is being opened and closed... do you have any suggestion on how to nail down the execution trace?

tkoeppe commented 5 years ago

I think the code for loading those PK3s is all part of ioq3, so it's quite possible that that's just leaking. (It was never designed to be hosted like we use it.) You could try running in a debugger, breaking at the syscall and then finding out where that's happening. Once we have that, we could look in the code for obvious omissions of cleanup.

wookayin commented 5 years ago

Okay, I will try to use debuggers to identify the leak and let you know when some clue is found. Thanks!

sjtuytc commented 5 years ago

For anyone still have this trouble, you can just wrap the loop content with a subprocess, and this error will disappear.