Training process gets killed due to OOM

opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)

Apache License 2.0

1.13k stars 119 forks source link

Summary of issue

The training process gets killed by the kernel. There is a log in dmesg stating that the reason is "out of memory".

Model: MuZero with self-supervision Environment: Pong Architecture is exactly the same as the default one for Atari envs except that:

I am using RGB instead of grayscale (so input to the model is (B, 12, 96, 96) with 4 stacked frames)
I am using a few additional layers in the representation network

The process gets killed after 40k iteration steps (a bit more than 500k environment steps). The Buffer/memory_usage/process log shows that the total memory used starts from 0 and increases a bit faster than linearly to 6e+4, after which the process is killed.

NOTE: I have been able to reproduce the "Quick Start" training run on Pong with the default config. No issue there.

General questions:

Why does the memory used by the process seem to always increase? Is it the replay buffer?
Is there a way to control the memory used from any of the config settings, so that the process does not get killed?

Hello,

Thank you for your attention and inquiry.

The steady increase in process memory is primarily due to the continuous addition of collected data into the replay buffer, causing the memory occupation of the replay buffer to rise persistently. We can make a rough estimate: 12*96*96/1024/1024/1024*1e6 ≈ 103GB. Here, 1e6 is the default capacity/size of the replay buffer.
You can indeed control memory usage by modifying the configuration settings.
- Firstly, we recommend using grayscale images, which reduces memory usage by a factor of 3. Previous experiments have shown that this change will hardly affect performance negatively.
- Secondly, you may consider reducing the size of the replay buffer. However, please note that this may slightly decrease the performance of the algorithm, and the specific impact would depend on your environment and algorithm settings.
- Furthermore, you can consider using a more efficient data storage format, such as converting images into strings for storage. You can add "transform2string=True, gray_scale=True," to the policy field in the configuration. However, please note that this feature is currently under development, and we highly welcome your contribution.

If the above methods cannot solve the problem, you might need to consider increasing the memory capacity of your system.

Best wishes.

opendilab / LightZero

Training process gets killed due to OOM #82

Summary of issue