Open ZaberKo opened 2 years ago
thanks for the report. we are aware of a potential memory leak in the framework. we will update when we understand better about the problem. one bandaid is that we are introducing recovery functionality after a worker fails, which should bring the crashed worker back automatically.
Update: After test different config parameters, I found when I stop using evaluation (by setting evaluation_interval
to a large number), no OOM happens any more(tested on 100000 iterations). Therefore, Is there something wrong with the "sync_weight" function from "workers" and "evaluation_workers"? I am aware that code here is a little bit tricky by overwriting the unwritable numpy arrays to pytorch tensors during syncing.
Framework: pytorch 1.11.0
By viewing the memory change in netdata and the logs in rllib output, I find that the _queue.Empty
error happens first, then the memory starts blowing. @gjoliver So I guess some code raises an error, which leads to the death of some worker (could cause the error of _queue.Empty
), then results in the OOM.
For auto-recovery, It seems working on the Trainer
level, and relying on the ray.tune.run()
API. In my case, I want to implement my own Trainer class and call it manually (with some special design), and it is hard to save all states for recovery.
that's interesting. does the old reproduce script still work? any chance you can help update the repro script, and we'd really love to take a look. thanks for all the debugging so far.
that's interesting. does the old reproduce script still work? any chance you can help update the repro script, and we'd really love to take a look. thanks for all the debugging so far.
@gjoliver Yes, the script still works.
After further investigation, I finally find the source of the bug. It seems that the trained policy will continue the game with a long time, i.e., long timesteps in an episode(eg: 98490 timesteps in Breakout). For example, in Breakout, sometimes the agent will get stuck like this: 😂
This will finally run out the memory of sampler
in evaluation workers, which temporarily store the current episode. This explains why when I disabled the evaluation, the OOM will not happen.
Some ideas about the solution:
"horizon": 5000
in config
sampler
in evaluation workers. For example, give an option for only collecting the reward instead of everything. In addition, data efficiency should be improved in the sampler
; currently the memory cost is multi times of the real observation data size in the sampler. (Eg: the real size of obs in 98490 timesteps is around 10.3GB)why not just limit the horizon of the episodes to a reasonable length?
why not just limit the horizon of the episodes to a reasonable length?
Yeah, I think currently that is enough. Although it is still not suitable for machine with low memory size(eg: 32GB memory is not enough for horizion=3000). For original horizon(400,000), I test on a machine with 512GB memory still results in an OOM at evaluation stage. But the real data size is actually not that big (as I mentioned above).
Besides, I thinks additional "tip" for setting a reasonable horizon should be added to the document.
I see. completely agree! thanks for all the detective work :)
Any new updates for solving this issue?
What happened + What you expected to happen
When I train IMPALA, the program gets out of memory after a very lengthy period of time-steps (after 10000 iterations). The first error is like:
, then the memory increases until the process is killed by OOM (confirmed in netdata chart and
dmesg
). I also test it on two different machines, both of them will cause the OOM. I think the issue is aroundMultiGPULearnerThread
.Versions / Dependencies
Python: 3.9.10 Ray: tested on 1.10.0 & 1.11.0 OS: Ubuntu 20.04
3rd Lib: None
Reproduction script
The issue happens with
tune.run()
or manually training.