Closed rstrudel closed 1 year ago
I could be wrong, but isn't this simply showing the replay buffer increasing in size? The default is 1 million timesteps.
Thanks for your answer @ericl ! Indeed it seems that I misunderstood the replay buffer logic. I thought the whole replay buffer memory was allocated statically at initialization and not dynamically which is the case and would explain the memory growth.
I got out of memory errors after training during one night on environments working well with rlkit so I thought about a memory leak.. Maybe it comes from the fact that a replay buffer is created for each parallel worker (which then could explain the lack of memory once you use too many workers) but I am not sure about the rllib logic, I need to read the codebase more thoroughly. I will close this issue meanwhile and do more experiments.
I limited the replay buffer size to 1000 and still saw unbounded memory usage during SAC training. My setup: Ubuntu 18.04 Ray 1.0.1.post1 Tensorflow 2.3.0 Python 3.8.6 and htop shows memory usage as
3056 bbb 20 0 62.7G **17.5G** 201M S 0.0 13.9 0:00.00 ray::SAC.train()
how much memory usage is considered reasonable to train SAC?
@billyzs just out of curiosity: What's the size of your observation and action spaces?
OBS: 5 by 1 ints Action space: 2 by 1 float
On Mon, Jan 18, 2021 at 12:22 Sven Mika notifications@github.com wrote:
@billyzs https://github.com/billyzs just out of curiosity: What's the size of your observation and action spaces?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/10530#issuecomment-762454757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ2ULER2INIB667WDMMH4TS2SKBBANCNFSM4QUUGUWA .
I should also mention that I was using the offline dataset API (via a custom but stateless InputReader
). Perhaps it's something particular to this code path?
From our conversation on slack:
Could you give me the observation and action spaces? I want to try reproducing this with a respective random env, just to see whether I see the same thing with a buffer of 1000. Trying to reproduce this now …
I’m not seeing SAC leaking on an Atari env (observation space float32 84x84x3), even with a larger buffer (1M).
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 12.1/16.0 GiB
Memory usage on this node: 12.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.5/16.0 GiB
Memory usage on this node: 9.5/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 9.4/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.1/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.9/16.0 GiB
Memory usage on this node: 12.0/16.0 GiB
Memory usage on this node: 12.0/16.0 GiB
Memory usage on this node: 12.3/16.0 GiB
Memory usage on this node: 12.4/16.0 GiB
Memory usage on this node: 12.3/16.0 GiB
Memory usage on this node: 12.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 8.6/16.0 GiB
Memory usage on this node: 8.7/16.0 GiB
Memory usage on this node: 8.8/16.0 GiB
Memory usage on this node: 9.0/16.0 GiB
Memory usage on this node: 9.0/16.0 GiB
Memory usage on this node: 9.3/16.0 GiB
Memory usage on this node: 9.4/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 9.8/16.0 GiB
Memory usage on this node: 9.9/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.1/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.2/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.6/16.0 GiB
Memory usage on this node: 11.7/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 11.8/16.0 GiB
Memory usage on this node: 9.3/16.0 GiB
Memory usage on this node: 9.6/16.0 GiB
Memory usage on this node: 10.2/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.0/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.3/16.0 GiB
Memory usage on this node: 11.4/16.0 GiB
Memory usage on this node: 9.7/16.0 GiB
Memory usage on this node: 10.0/16.0 GiB
Memory usage on this node: 10.7/16.0 GiB
Memory usage on this node: 10.8/16.0 GiB
Memory usage on this node: 10.9/16.0 GiB
Memory usage on this node: 11.5/16.0 GiB
Memory usage on this node: 10.3/16.0 GiB
Memory usage on this node: 10.4/16.0 GiB
Memory usage on this node: 10.5/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
Memory usage on this node: 10.6/16.0 GiB
+----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+
| Trial name | status | loc | iter | total time (s) | ts | reward | episode_reward_max | episode_reward_min | episode_len_mean |
|----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------|
| SAC_MsPacmanNoFrameskip-v4_f8a38_00000 | TERMINATED | | 1320 | 1396.81 | 100000 | 462 | 880 | 320 | 2185 |
+----------------------------------------+------------+-------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+
^^ 100k timesteps.
... I can try it with InputReader as well. …
....
I’m running the same experiment with the input key set to a file. Not seeing the leaking when using a smaller file (~64Mb). What are your file sizes roughly?
Also with 10 files of size 64Mb each (input=[directory w/ 10 json files, 64Mb each]]), I’m not seeing any leaking. Running it for 100k ts.
I did switch off input_evaluation=[] , though. I do remember that in these offline estimators, we store estimation results until metrics are requested. Maybe that’s where it’s exploding on your end. Can you try switching it off as well (or would this break other things)? It’s only used for reporting purposes.
@billyzs ^
What is the problem?
When running SAC on Pendulum-v0 with gpu and single or multiple workers, rllib suffers from a memory leak independently of the framework used (tf or torch). The attached log shows the linear growth rate of memory usage during training (pink and cyan curves). I also used PPO on Pendulum-v0 with both framework and did not observe any memory leak as the log shows (red and blue curves), the memory usage is constant.
The reported logs are with
num_workers=4
andnum_envs_per_worker=4
, I also observe the leak when settingnum_workers=0
andnum_envs_per_worker=1
.Ray version and other system information (Python version, TensorFlow version, OS): ray: 0.9.0.dev0 python: 3.7 pytorch: 1.6.0 tensorflow: 2.3.0 os: ubuntu 18.04
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.