Closed Svalorzen closed 3 years ago
Can you reproduce this with a toy env? We can't debug scripts that aren't self contained.
Hi, I got same error with this issue, any update here??
I think I risolved the problem, for me the length of the episodes was way too high, and so ray was trying to keep in memory a ton of experiences which basically ate infinite memory. Not sure if this is the same as what you are seeing.
Oh OK mine is 2000 per episode, do you happen to remember the length?
I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6.
I am using sac,
analysis = tune.run(sac.SACTrainer,
config={"env": "RsmAtt",
"num_gpus" : 1,
"num_workers" : 0,
"use_pytorch" : 1,
"framework" : 'torch',
"buffer_size" : int(1e4),
"rollout_fragment_length":100},
stop={"training_iteration": 200},
)
and memory keep increasing 300Mb each iteration
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6.
Could you find the reason for the increasing memory issue?
No, I still have this issue...
What is the problem?
I am training QMIX with a custom 6-agent environment, and memory usage just seem to grow infinitely over time. The problem might be related to #3884 but I am not sure.
The custom environment is wrapper around a dynamic C++ library built with Boost Python. I can share it if needed. I have tried to limit the memory of Ray in the
init()
call but it doesn't seem to have any effect. Memory usage grows slowly over time, reaching ~64GB after 1 hours.Ray version: 0.8.0 Python version: 3.7.4 OS: CENTOS 7.7.1908 (cluster)
Reproduction
This is the Python script I am using (the first two imports are the custom libraries). Ray reports, after initialization:
Starting Ray with 29.79 GiB memory available for workers and up to 18.63 GiB for objects.
But total memory usage exceeds 64GB which makes the training crash as I don't have more.