Closed sergeivolodin closed 3 years ago
@sergeivolodin How high does it go? Also, can you try different Tensorflow versions?
@richardliaw Here it ate all the remaining memory on my laptop and crashed wsl. Which versions of tensorflow do you want me to try?
We also observed the same issue running QMIX with PyTorch on SMAC. In my case, the machine has 16GB of RAM, and the training would eventually consume all the RAM and crash (at about 600+ iteration using the 2s3z map).
Tried setting object_store_memory=1*10**9, redis_max_memory=1*10**9
for ray init but didn't help.
Ray version: 0.8.7 OS: Ubuntu 16.04 Pytorch version: 1.3.0 command used: python run_qmix.py --num-iters=1000 --num-workers=7 --map-name=2s3z (using the example from SMAC repository)
Error message:
Failure # 1 (occurred at 2020-08-20_13-12-55)
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/worker.py", line 1538, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ^[[36mray::QMIX.train()^[[39m (pid=949, ip=10.11.0.13)
File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node rltb-1 is used (14.9 / 15.67 GB). The top 10 memory consumers are:
PID MEM COMMAND
949 6.25GiB ray::QMIX
4628 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24827 -dataDir /home/ubu
4631 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24013 -dataDir /home/ubu
4637 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20642 -dataDir /home/ubu
4625 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20480 -dataDir /home/ubu
4624 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 19296 -dataDir /home/ubu
4626 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 21827 -dataDir /home/ubu
4627 0.5GiB /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 17905 -dataDir /home/ubu
@jyericlin as a workaround, we just create a separate process for every training step, which opens a checkpoint, does the iteration and then saves a checkpoint (pseudocode below, fully functional example here). The extra time spent on creating process/checkpointing does not seem too bad (for our case!)
while True:
config_filename = pickle_config(config)
# starts a new python process from bash
# important, can't just fork
# because of tensorflow+fork import issues
start_process_and_wait(target=train, config_filename)
results = unpickle_results(config)
delete_temporary_files()
if results.iteration > N: break
config['checkpoint'] = results['checkpoint']
def train(config_filename):
config = unpickle_config(config_filename)
trainer = PPO(config)
trainer.restore(config['checkpoint'])
results = trainer.train()
checkpoint = trainer.save()
results['checkpoint'] = checkpoint
pickle_results(results)
# important -- otherwise memory will go up!
trainer.stop()
Note that in train
we also need to reconnect to the existing ray
instance to reuse worker processes.
@richardliaw any updates? Do you want me to try different tf versions? Thank you.
Hmm, could you try the latest Ray wheels (latest snapshot of master) to see if this was fixed?
@richardliaw same thing
@richardliaw is there a way to attach a Python debugger to a ray worker, preferably from an IDE like PyCharm? I could take a look at where and why the memory is being used
There's now a Ray PDB tool: https://docs.ray.io/en/master/ray-debugging.html?highlight=debugger
Is there any update on this issue? I have been dealing with the same.
Hi @sergeivolodin, how are you plotting this graphs, please? suspecting that I'm having a similar issue
@azzeddineCH using this custom script:
https://github.com/HumanCompatibleAI/better-adversarial-defenses/tree/master/other/memory_profile
$ pip install psutil numpy matplotlib humanize inquirer
$ python mem_profile.py
-- will write data to a file for all processes for your user and save to mem_out_{username}_{time_start}.txt
(will print the filename)$ python mem_analyze.py
. For the last one, there are options--input FILENAME
, or the script will ask you which file to open (navigate up-down, use Enter to select)--customize
flag and selecting the processes interactively (navigate up/down, SPACE bar to select, enter to finish selecting)--track PROCESS_NAME
flag--subtract
--max_lines
only reads that many lines from the file (useful if the first script was running for more time than the monitored program itself)Good luck!
I think I'm runing into this too. seems like there are "jumps" in memory at regular intervals for me. The leak is very slow for me, and I'm pretty sure I've ruled out everything else it could be but a bug in MA implementation in RLLib.
This is with the latest wheels, python 3.8, pytorch 1.8.0
I've done some more digging using tracemalloc and I dont think the bug is actually in the python code as there's no large consistent allocations of python objects. This leaves it to be a potential issue with pytorch or some c++ library, or something to do with how ray handles worker memory.
The bug also does not seem to happen consistently for me across machines. More specifically, on servers where cgroups are enabled I get the memory leak, but on mu local machine I do not. @sven1977 any ideas? I added a bit more information here: https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100
What is the problem?
When training in a multi-agent environment using multiple environment workers, the memory of the workers increases constantly and is not released after the policy updates.
If no memory limit is set, the processes run out of system memory and are killed. If the
memory_per_worker
limit is set, they go past the limit and are killed.Ray version and other system information (Python version, TensorFlow version, OS):
ray==0.8.6
Python 3.8.5 (default, Aug 5 2020, 08:36:46)
tensorflow==2.3.0
Ubuntu 18.04.4 LTS (GNU/Linux 4.19.121-microsoft-standard x86_64)
(same thing in non-wsl as well)Reproduction
Run this and measure memory consumption. If you remove
memory_per_worker
limits, it will take longer, as workers will try to consume all system memory.