ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[rllib] SAC RAM Memory Leak #10530

Closed rstrudel closed 1 year ago

rstrudel commented 4 years ago

What is the problem?

When running SAC on Pendulum-v0 with gpu and single or multiple workers, rllib suffers from a memory leak independently of the framework used (tf or torch). The attached log shows the linear growth rate of memory usage during training (pink and cyan curves). I also used PPO on Pendulum-v0 with both framework and did not observe any memory leak as the log shows (red and blue curves), the memory usage is constant.

Screenshot from 2020-09-03 12-54-28

The reported logs are with num_workers=4 and num_envs_per_worker=4, I also observe the leak when setting num_workers=0 and num_envs_per_worker=1.

Ray version and other system information (Python version, TensorFlow version, OS): ray: 0.9.0.dev0 python: 3.7 pytorch: 1.6.0 tensorflow: 2.3.0 os: ubuntu 18.04

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

import ray
import ray.rllib.agents.sac as sac
from ray.tune.logger import pretty_print

ray.init()
config = sac.DEFAULT_CONFIG.copy()
config["timesteps_per_iteration"] = 1000
config["num_gpus"] = 1
config["num_workers"] = 4
config["num_envs_per_worker"] = 4
config["framework"] = "tf"
config["horizon"] = 200

trainer = sac.SACTrainer(config=config, env="Pendulum-v0")

for i in range(1000):
    result = trainer.train()
    print(pretty_print(result))

If we cannot run your script, we cannot fix your issue.

ericl commented 4 years ago

I could be wrong, but isn't this simply showing the replay buffer increasing in size? The default is 1 million timesteps.

rstrudel commented 4 years ago

Thanks for your answer @ericl ! Indeed it seems that I misunderstood the replay buffer logic. I thought the whole replay buffer memory was allocated statically at initialization and not dynamically which is the case and would explain the memory growth.

I got out of memory errors after training during one night on environments working well with rlkit so I thought about a memory leak.. Maybe it comes from the fact that a replay buffer is created for each parallel worker (which then could explain the lack of memory once you use too many workers) but I am not sure about the rllib logic, I need to read the codebase more thoroughly. I will close this issue meanwhile and do more experiments.

billyzs commented 3 years ago

I limited the replay buffer size to 1000 and still saw unbounded memory usage during SAC training. My setup: Ubuntu 18.04 Ray 1.0.1.post1 Tensorflow 2.3.0 Python 3.8.6 2021-01-14-152457_2181x483_scrot and htop shows memory usage as

 3056 bbb      20   0 62.7G **17.5G**  201M S  0.0 13.9  0:00.00 ray::SAC.train()   

how much memory usage is considered reasonable to train SAC?

sven1977 commented 3 years ago

@billyzs just out of curiosity: What's the size of your observation and action spaces?

billyzs commented 3 years ago

OBS: 5 by 1 ints Action space: 2 by 1 float

On Mon, Jan 18, 2021 at 12:22 Sven Mika notifications@github.com wrote:

@billyzs https://github.com/billyzs just out of curiosity: What's the size of your observation and action spaces?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/10530#issuecomment-762454757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ2ULER2INIB667WDMMH4TS2SKBBANCNFSM4QUUGUWA .

billyzs commented 3 years ago

I should also mention that I was using the offline dataset API (via a custom but stateless InputReader). Perhaps it's something particular to this code path?

sven1977 commented 3 years ago

From our conversation on slack:

Could you give me the observation and action spaces? I want to try reproducing this with a respective random env, just to see whether I see the same thing with a buffer of 1000. Trying to reproduce this now …

I’m not seeing SAC leaking on an Atari env (observation space float32 84x84x3), even with a larger buffer (1M).

Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 12.1/16.0 GiB
Memory usage on this node: 12.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.5/16.0 GiB
Memory usage on this node: 9.5/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 9.4/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 12.0/16.0 GiB
Memory usage on this node: 12.0/16.0 GiB
Memory usage on this node: 12.3/16.0 GiB
Memory usage on this node: 12.4/16.0 GiB
Memory usage on this node: 12.3/16.0 GiB
Memory usage on this node: 12.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 8.6/16.0 GiB
Memory usage on this node: 8.7/16.0 GiB
Memory usage on this node: 8.8/16.0 GiB
Memory usage on this node: 9.0/16.0 GiB
Memory usage on this node: 9.0/16.0 GiB
Memory usage on this node: 9.3/16.0 GiB
Memory usage on this node: 9.4/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 9.3/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
+----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+
| Trial name                             | status     | loc   |   iter |   total time (s) |     ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------|
| SAC_MsPacmanNoFrameskip-v4_f8a38_00000 | TERMINATED |       |   1320 |          1396.81 | 100000 |      462 |                  880 |                  320 |               2185 |
+----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+

^^ 100k timesteps.

... I can try it with InputReader as well. …

....

I’m running the same experiment with the input key set to a file. Not seeing the leaking when using a smaller file (~64Mb). What are your file sizes roughly?

Also with 10 files of size 64Mb each (input=[directory w/ 10 json files, 64Mb each]]), I’m not seeing any leaking. Running it for 100k ts.

I did switch off input_evaluation=[] , though. I do remember that in these offline estimators, we store estimation results until metrics are requested. Maybe that’s where it’s exploding on your end. Can you try switching it off as well (or would this break other things)? It’s only used for reporting purposes.

sven1977 commented 3 years ago

@billyzs ^