Open mcrowson opened 4 years ago
Same issue. Not only with tune.
How to kill those idle processes? I used Ray to parallelize nested for loops, now tons of IDLE processes are remaining on the remote.
Hack for now ps aux | grep ray::IDLE | grep -v grep | awk '{print $2}' | xargs kill -9
Yea that is what I've been doing, just seems like the wrong long term solution.
I have the same issue, but including some rollout workers and other stuffs, and it is even allocating part of the RAM. I would like to have a simple, cross-platform way to kill them all.
This has confused me for some days. kill pid
can only apply on one node...
And why this issue's priority is just P3
? @rkooo567
Hmm, maybe we can revisit this. It looks like many people seem to have the same issue.
To be clear, did I understand this issue correctly?
Is this correct?
In my case https://github.com/ray-project/ray/issues/22154,
cc @rkooo567 is this still tracked? Can I assign this to you or someone else from core? (it currently shows up as Ray Tune issue which it apparently is not)
You can always do ray stop --force
.
@krfricke it is not tracked right now. Do you have a repro btw?
hi , i hava same problem issue too. Although the job is finished, the processes are still running and i dont want to use script that kills to process. do u have any solution ?
@muratkoc93 is it possible for you to share the reproduction script? We will fix the issue, and having more repro script will help us fixing issue quickly. If there are IDLE processes not cleaned up after the job is terminated, it is a bug
I'm not able to reproduce this. Does anyone have a repro script? Else I think we should close.
I tried with VSCode 1.74.2, Ray 2.1.0. I created the following Ray Tune script and entered debugging mode, then killed it with KeyboardInterrupt (cmd+c). I saw the IDLE processes come into existence and then immediately go away. I tried for both local mode and with Ray running on my machine (ray start --head
). I also tried killing the top-level process from the VSCode debugger UI.
#!/usr/bin/env python3
from ray import tune
def to_debug(*args):
import time
i = 0
while True:
print('iter', i)
time.sleep(1)
i += 1
tune.run(to_debug)
while true; do echo 'checking..' ; ps aux | grep 'IDL[E]'; sleep 1; done
It looks like we fixed a bug where workloads with Ray Datasets would leak IDLE processes due to a leaked reference to a stats actor. This is fixed in https://github.com/ray-project/ray/issues/22154. It could be that the observed leaking was due to this..
I will close for now, if someone has a repro script happy to fix
cc @muratkoc93 @rkooo567
I still see this issue on the latest ray release (ray-2.3.0) @cadedaniel. I'll try to get a reliable reproduction.
Please reopen the issue once you find the repro!
Same issue on an Ubuntu machine! Any updates on a solution? @mjlbach
@claysmyth do you have a repro? We want to fix this!
@cadedaniel My issue may have been a false alarm actually. My VSCode crashed while running a jupyter notebook utilizing ray remote. I then forced logged out of my account (which usually kills running processes). I think what actually happened is that having ray remotes idle kept the jupyter notebook server running, even after logging out. Once I found and killed the jupyter notebook, the ray remotes also were killed.
However, I'll do some digging and report back if I find anything strange. Thanks!
i see a ton of these when working remotely w/ vscode (in a devcontainer)
@dss010101 can you provide a repro script?
@dss010101 can you provide a repro script?
i will try to come up with one...but basically
Has there been any solution to this issue? I still see leaking ray:IDLEs, especially after version 2.11.0
EDIT: after every run of Ray Workflow (~100 workflow tasks) we see additional IDLEs being left behind which are taking up more and more memory.
What is the problem?
When developing locally I sometimes start up a tune.run, and part way through i might kill the process from my IDE (using VSCodium). This disconnects me and the IDE indicates that no debug process is running, however I have extra ray:IDLE processes that still are sucking up ram.
Reproduction (REQUIRED)
Run that with the debugger from within VSCodium and then click that nice juicy disconnect button up top because you realized your gym env is messed up and stuck in a loop.