[RLlib] [Bug] IMPALA causes an OOM after a long running.

ZaberKo commented 2 years ago

What happened + What you expected to happen

When I train IMPALA, the program gets out of memory after a very lengthy period of time-steps (after 10000 iterations). The first error is like:

(ImpalaTrainer pid=850615) Exception in thread Thread-7:
(ImpalaTrainer pid=850615) Traceback (most recent call last):
(ImpalaTrainer pid=850615)   File "/home/zaber/mambaforge/envs/ray/lib/python3.9/threading.py", line 973, in _bootstrap_inner
(ImpalaTrainer pid=850615)     self.run()
(ImpalaTrainer pid=850615)   File "/home/zaber/mambaforge/envs/ray/lib/python3.9/site-packages/ray/rllib/execution/learner_thread.py", line 69, in run
(ImpalaTrainer pid=850615)     self.step()
(ImpalaTrainer pid=850615)   File "/home/zaber/mambaforge/envs/ray/lib/python3.9/site-packages/ray/rllib/execution/multi_gpu_learner_thread.py", line 143, in step
(ImpalaTrainer pid=850615)     buffer_idx, released = self.ready_tower_stacks_buffer.get()
(ImpalaTrainer pid=850615)   File "/home/zaber/mambaforge/envs/ray/lib/python3.9/site-packages/ray/rllib/execution/buffers/minibatch_buffer.py", line 46, in get
(ImpalaTrainer pid=850615)     self.buffers[self.idx] = self.inqueue.get(timeout=self.timeout)
(ImpalaTrainer pid=850615)   File "/home/zaber/mambaforge/envs/ray/lib/python3.9/queue.py", line 179, in get
(ImpalaTrainer pid=850615)     raise Empty
(ImpalaTrainer pid=850615) _queue.Empty

, then the memory increases until the process is killed by OOM (confirmed in netdata chart and dmesg). I also test it on two different machines, both of them will cause the OOM. I think the issue is around MultiGPULearnerThread.

Versions / Dependencies

Python: 3.9.10 Ray: tested on 1.10.0 & 1.11.0 OS: Ubuntu 20.04

3rd Lib: None

Reproduction script

The issue happens with tune.run() or manually training.

# %%
from ray.rllib.agents.impala import ImpalaTrainer, DEFAULT_CONFIG
import ray
from ray import tune
ray.init(
    # include_dashboard=True,
    num_cpus=32,num_gpus=1)
# %%

config = DEFAULT_CONFIG.copy()
config.update({
    "framework": "torch",
    "num_gpus": 1,
    "num_workers": 16,
    "num_envs_per_worker": 5,
    "clip_rewards": True,
    "evaluation_num_workers": 4,
    "evaluation_interval": 100,
    "evaluation_duration": 20,
    "evaluation_config": {
        # "num_gpus_per_worker": 0.01,
        "explore": False
    },
    "env": "BreakoutNoFrameskip-v4",
    "rollout_fragment_length": 50,
    "train_batch_size": 500,

    "lr_schedule": [
        [0, 0.0005],
        [20000000, 0.000000000001]],
    "min_time_s_per_reporting":0,
    "timesteps_per_iteration":0,
    # "log_level": "DEBUG"
})

#%%
stop = {
    "training_iteration": 100000,
}

tune.run(ImpalaTrainer,config=config,stop=stop)

# %%
# or training manually
trainer = ImpalaTrainer(config=config)
for i in range(100000):
    res = trainer.train()

gjoliver commented 2 years ago

thanks for the report. we are aware of a potential memory leak in the framework. we will update when we understand better about the problem. one bandaid is that we are introducing recovery functionality after a worker fails, which should bring the crashed worker back automatically.

ZaberKo commented 2 years ago

Update: After test different config parameters, I found when I stop using evaluation (by setting evaluation_interval to a large number), no OOM happens any more(tested on 100000 iterations). Therefore, Is there something wrong with the "sync_weight" function from "workers" and "evaluation_workers"? I am aware that code here is a little bit tricky by overwriting the unwritable numpy arrays to pytorch tensors during syncing.

Framework: pytorch 1.11.0

ZaberKo commented 2 years ago

By viewing the memory change in netdata and the logs in rllib output, I find that the _queue.Empty error happens first, then the memory starts blowing. @gjoliver So I guess some code raises an error, which leads to the death of some worker (could cause the error of _queue.Empty), then results in the OOM.

For auto-recovery, It seems working on the Trainer level, and relying on the ray.tune.run() API. In my case, I want to implement my own Trainer class and call it manually (with some special design), and it is hard to save all states for recovery.

gjoliver commented 2 years ago

that's interesting. does the old reproduce script still work? any chance you can help update the repro script, and we'd really love to take a look. thanks for all the debugging so far.

ZaberKo commented 2 years ago

that's interesting. does the old reproduce script still work? any chance you can help update the repro script, and we'd really love to take a look. thanks for all the debugging so far.

@gjoliver Yes, the script still works.

ZaberKo commented 2 years ago

After further investigation, I finally find the source of the bug. It seems that the trained policy will continue the game with a long time, i.e., long timesteps in an episode(eg: 98490 timesteps in Breakout). For example, in Breakout, sometimes the agent will get stuck like this: 😂

https://user-images.githubusercontent.com/20830726/164035802-733ccb8b-4635-4b94-b447-07c76b24004f.mp4

This will finally run out the memory of sampler in evaluation workers, which temporarily store the current episode. This explains why when I disabled the evaluation, the OOM will not happen.

Some ideas about the solution:

Override the default horizon in Gym Atari (400,000) by setting "horizon": 5000 in config
- Btw, the 400,000 is the raw timesteps limit (without any wrapper like frame-skip), but it is mistakenly set as the horizon of the wrapped env (with frame-skip etc.). This is a trivial bug and does not affect the behavior. https://github.com/ray-project/ray/blob/9de391b70e4531e60a560ccbf77681d520030856/rllib/evaluation/sampler.py#L606-L612
Change the implementation of sampler in evaluation workers. For example, give an option for only collecting the reward instead of everything. In addition, data efficiency should be improved in the sampler; currently the memory cost is multi times of the real observation data size in the sampler. (Eg: the real size of obs in 98490 timesteps is around 10.3GB)

gjoliver commented 2 years ago

why not just limit the horizon of the episodes to a reasonable length?

ZaberKo commented 2 years ago

why not just limit the horizon of the episodes to a reasonable length?

Yeah, I think currently that is enough. Although it is still not suitable for machine with low memory size(eg: 32GB memory is not enough for horizion=3000). For original horizon(400,000), I test on a machine with 512GB memory still results in an OOM at evaluation stage. But the real data size is actually not that big (as I mentioned above).

Besides, I thinks additional "tip" for setting a reasonable horizon should be added to the document.

gjoliver commented 2 years ago

I see. completely agree! thanks for all the detective work :)

ZaberKo commented 1 year ago

Any new updates for solving this issue?

ray-project / ray