ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

ray::IDLE processes persist if I disconnect and kill master process from IDE #9528

Open mcrowson opened 4 years ago

mcrowson commented 4 years ago

What is the problem?

When developing locally I sometimes start up a tune.run, and part way through i might kill the process from my IDE (using VSCodium). This disconnects me and the IDE indicates that no debug process is running, however I have extra ray:IDLE processes that still are sucking up ram.

Reproduction (REQUIRED)

from ray import rllib, tune
from ray.rllib.examples.env.random_env import RandomEnv

class DummyEnv(RandomEnv):

    def step(self, obs):
        while True:
            continue

conf = rllib.agents.ppo.DEFAULT_CONFIG.copy()
conf['env'] = DummyEnv

tune.run(
        rllib.agents.ppo.PPOTrainer,
        config=conf
    )

Run that with the debugger from within VSCodium and then click that nice juicy disconnect button up top because you realized your gym env is messed up and stuck in a loop.

>ps -aux | grep ray:                                     
myuser    23820  1.6  1.5 13424408 510184 pts/3 Sl   16:26   0:04 ray::PPO.train()
myuser    23821  0.5  0.1 8117500 62816 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23822  0.4  0.1 8117500 62892 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23823  0.4  0.1 8117492 62648 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23824  0.5  0.1 8117492 62716 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23825  0.4  0.1 8117492 62432 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23826  0.4  0.1 8117492 62408 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23827  0.4  0.1 8117492 62784 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23828  0.4  0.1 8117492 62780 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23829  0.4  0.1 8117492 62800 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23830  0.4  0.1 8117492 62428 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23831  0.4  0.1 8117492 62888 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23832 98.8  1.4 12063172 474880 pts/3 Rl   16:26   4:10 ray::RolloutWorker.par_iter_next()
myuser    23833  0.4  0.1 8117516 62516 pts/3   Sl   16:26   0:01 ray::IDLE
myuser    23834 98.8  1.4 12063236 474352 pts/3 Rl   16:26   4:09 ray::RolloutWorker.par_iter_next()
myuser    23835  0.4  0.1 8117492 62716 pts/3   Sl   16:26   0:01 ray::IDLE
kivo360 commented 4 years ago

Same issue. Not only with tune.

linlinzhao commented 4 years ago

How to kill those idle processes? I used Ray to parallelize nested for loops, now tons of IDLE processes are remaining on the remote.

KaleabTessera commented 4 years ago

Hack for now ps aux | grep ray::IDLE | grep -v grep | awk '{print $2}' | xargs kill -9

mcrowson commented 4 years ago

Yea that is what I've been doing, just seems like the wrong long term solution.

duburcqa commented 3 years ago

I have the same issue, but including some rollout workers and other stuffs, and it is even allocating part of the RAM. I would like to have a simple, cross-platform way to kill them all.

DonYum commented 3 years ago

This has confused me for some days. kill pid can only apply on one node... And why this issue's priority is just P3? @rkooo567

rkooo567 commented 2 years ago

Hmm, maybe we can revisit this. It looks like many people seem to have the same issue.

To be clear, did I understand this issue correctly?

Is this correct?

rokrokss commented 2 years ago

In my case https://github.com/ray-project/ray/issues/22154,

krfricke commented 2 years ago

cc @rkooo567 is this still tracked? Can I assign this to you or someone else from core? (it currently shows up as Ray Tune issue which it apparently is not)

mimoralea commented 2 years ago

You can always do ray stop --force.

rkooo567 commented 2 years ago

@krfricke it is not tracked right now. Do you have a repro btw?

muratkoc93 commented 1 year ago

hi , i hava same problem issue too. Although the job is finished, the processes are still running and i dont want to use script that kills to process. do u have any solution ?

rkooo567 commented 1 year ago

@muratkoc93 is it possible for you to share the reproduction script? We will fix the issue, and having more repro script will help us fixing issue quickly. If there are IDLE processes not cleaned up after the job is terminated, it is a bug

cadedaniel commented 1 year ago

I'm not able to reproduce this. Does anyone have a repro script? Else I think we should close.

I tried with VSCode 1.74.2, Ray 2.1.0. I created the following Ray Tune script and entered debugging mode, then killed it with KeyboardInterrupt (cmd+c). I saw the IDLE processes come into existence and then immediately go away. I tried for both local mode and with Ray running on my machine (ray start --head). I also tried killing the top-level process from the VSCode debugger UI.

#!/usr/bin/env python3

from ray import tune

def to_debug(*args):
    import time
    i = 0
    while True:
        print('iter', i)
        time.sleep(1)
        i += 1

tune.run(to_debug)
while true; do echo 'checking..' ; ps aux | grep 'IDL[E]'; sleep 1; done
cadedaniel commented 1 year ago

It looks like we fixed a bug where workloads with Ray Datasets would leak IDLE processes due to a leaked reference to a stats actor. This is fixed in https://github.com/ray-project/ray/issues/22154. It could be that the observed leaking was due to this..

I will close for now, if someone has a repro script happy to fix

cc @muratkoc93 @rkooo567

mjlbach commented 1 year ago

I still see this issue on the latest ray release (ray-2.3.0) @cadedaniel. I'll try to get a reliable reproduction.

rkooo567 commented 1 year ago

Please reopen the issue once you find the repro!

claysmyth commented 1 year ago

Same issue on an Ubuntu machine! Any updates on a solution? @mjlbach

cadedaniel commented 1 year ago

@claysmyth do you have a repro? We want to fix this!

claysmyth commented 1 year ago

@cadedaniel My issue may have been a false alarm actually. My VSCode crashed while running a jupyter notebook utilizing ray remote. I then forced logged out of my account (which usually kills running processes). I think what actually happened is that having ray remotes idle kept the jupyter notebook server running, even after logging out. Once I found and killed the jupyter notebook, the ray remotes also were killed.

However, I'll do some digging and report back if I find anything strange. Thanks!

dss010101 commented 1 year ago

i see a ton of these when working remotely w/ vscode (in a devcontainer)

rkooo567 commented 1 year ago

@dss010101 can you provide a repro script?

dss010101 commented 1 year ago

@dss010101 can you provide a repro script?

i will try to come up with one...but basically

  1. create a vscode devcontainer (i built acontainer using amazonlinux:2023 image and installed python 3.11.3 on it)
  2. run remotely from windows to a linux box -> your container/dev environment should be running on the linux box
  3. build a simple flask app and just do ray.init() on start up
  4. run the flask project
  5. kill the python process (i do so my killing from the vscode terminal pane)
  6. if you do a "ps -ef| grep 'ray'" on the linux host where your container is running remotely, you probably will see a number of defunct/dangling processes over time depending on how many times you repeat steps 4 & 5
josomir commented 1 month ago

Has there been any solution to this issue? I still see leaking ray:IDLEs, especially after version 2.11.0

EDIT: after every run of Ray Workflow (~100 workflow tasks) we see additional IDLEs being left behind which are taking up more and more memory.