ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.99k stars 5.78k forks source link

[Tune] Unset signal catching event after tune.run() finished. #37737

Open woshiyyya opened 1 year ago

woshiyyya commented 1 year ago

What happened + What you expected to happen

Currently, we set the signal catching event globally in tune.run(). After the run finished, the user send a SIGTERM/SIGUSR1 signal to the driver's process, it will still trigger the error message we set here.

We need to clear that handler properly after the tune run finished.

Versions / Dependencies

master

Reproduction script

from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

def train_loop():
    print("dummy loop")

trainer = TorchTrainer(
   train_loop_per_worker=train_loop,
   scaling_config=ScalingConfig(num_workers=2)
)
trainer.fit()

import os, signal
os.kill(os.getpid(), signal.SIGUSR1)

Output:

2023-07-24 15:20:16,660 WARNING tune.py:192 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip. 

Issue Severity

Low: It annoys or frustrates me.

woshiyyya commented 1 year ago

cc @matthewdeng