opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
1.13k stars 119 forks source link

Training process gets killed due to OOM #82

Closed aceofgreens closed 1 year ago

aceofgreens commented 1 year ago

Summary of issue

The training process gets killed by the kernel. There is a log in dmesg stating that the reason is "out of memory".

Model: MuZero with self-supervision Environment: Pong Architecture is exactly the same as the default one for Atari envs except that:

The process gets killed after 40k iteration steps (a bit more than 500k environment steps). The Buffer/memory_usage/process log shows that the total memory used starts from 0 and increases a bit faster than linearly to 6e+4, after which the process is killed.

NOTE: I have been able to reproduce the "Quick Start" training run on Pong with the default config. No issue there.

General questions:

  1. Why does the memory used by the process seem to always increase? Is it the replay buffer?
  2. Is there a way to control the memory used from any of the config settings, so that the process does not get killed?
puyuan1996 commented 1 year ago

Hello,

Thank you for your attention and inquiry.

  1. The steady increase in process memory is primarily due to the continuous addition of collected data into the replay buffer, causing the memory occupation of the replay buffer to rise persistently. We can make a rough estimate: 12*96*96/1024/1024/1024*1e6 ≈ 103GB. Here, 1e6 is the default capacity/size of the replay buffer.

  2. You can indeed control memory usage by modifying the configuration settings.

    • Firstly, we recommend using grayscale images, which reduces memory usage by a factor of 3. Previous experiments have shown that this change will hardly affect performance negatively.
    • Secondly, you may consider reducing the size of the replay buffer. However, please note that this may slightly decrease the performance of the algorithm, and the specific impact would depend on your environment and algorithm settings.
    • Furthermore, you can consider using a more efficient data storage format, such as converting images into strings for storage. You can add "transform2string=True, gray_scale=True," to the policy field in the configuration. However, please note that this feature is currently under development, and we highly welcome your contribution.

If the above methods cannot solve the problem, you might need to consider increasing the memory capacity of your system.

Best wishes.