ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

[Core] Process signalling using `SIGINT` not working in `ray.remote` processes #31805

Open ezorita opened 1 year ago

ezorita commented 1 year ago

What happened + What you expected to happen

It seems ray processes are not able to handle signalling to subprocesses properly. When a task or actor creates a subprocess, it is not able to communicate with it using signal.SIGINT. The script below reproduces the issue, it spins up a subprocess sleep 100 and then signals it to finish. The subprocess should terminate with both SIGINT and SIGKILL, but under ray tasks/actors it only responds to SIGKILL.

I would expect process signalling to work normally.

Versions / Dependencies

ray 2.2.0 python 3.8.10

Reproduction script

import subprocess
import signal
import time
import ray

def process_signal(s):
    print("running and awaiting 'sleep 5'")
    p = subprocess.Popen(["sleep", "5"])
    retval = p.wait()
    print(f"done (retval {retval})")

    print(f"running and signal {s} to 'sleep 100' after 2 seconds")
    p = subprocess.Popen(["sleep", "100"])
    time.sleep(2)
    p.send_signal(s)
    retval = p.wait()
    print(f"done (retval {retval})")

@ray.remote
def process_signal_task(s):
    process_signal(s)

@ray.remote
class SignalTestActor:
    def __init__(self, s):
        process_signal(s)

if __name__ == "__main__":
    ray.init()
    print("running signal tests locally")
    process_signal(signal.SIGINT)
    process_signal(signal.SIGKILL)

    print("running signal tests on ray (SIGKILL)")
    SignalTestActor.remote(signal.SIGKILL)
    ray.get(process_signal_task.remote(signal.SIGKILL))

    print("running signal tests on ray (SIGINT)")
    SignalTestActor.remote(signal.SIGINT)
    ray.get(process_signal_task.remote(signal.SIGINT))

Issue Severity

High: It blocks me from completing my task.

ezorita commented 1 year ago

Clearly this issue is an inherited blocking mask. I have reviewed the code and there are two places in which a SIGINT blocking mask is applied to the process:

I wonder whether the signal blocking is strictly necessary, since we can't assume all the code (user + libraries) used in the children process will unblock the signals before forking further. They might rely on these signals to work properly.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.