Currently, we set the signal catching event globally in tune.run(). After the run finished, the user send a SIGTERM/SIGUSR1 signal to the driver's process, it will still trigger the error message we set here.
We need to clear that handler properly after the tune run finished.
Versions / Dependencies
master
Reproduction script
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
def train_loop():
print("dummy loop")
trainer = TorchTrainer(
train_loop_per_worker=train_loop,
scaling_config=ScalingConfig(num_workers=2)
)
trainer.fit()
import os, signal
os.kill(os.getpid(), signal.SIGUSR1)
Output:
2023-07-24 15:20:16,660 WARNING tune.py:192 -- Stop signal received (e.g. via SIGINT/Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
What happened + What you expected to happen
Currently, we set the signal catching event globally in
tune.run()
. After the run finished, the user send a SIGTERM/SIGUSR1 signal to the driver's process, it will still trigger the error message we set here.We need to clear that handler properly after the tune run finished.
Versions / Dependencies
master
Reproduction script
Output:
Issue Severity
Low: It annoys or frustrates me.