ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.22k stars 5.81k forks source link

Ray hangs when machine is disconnected from network #7696

Open hartikainen opened 4 years ago

hartikainen commented 4 years ago

What is the problem?

When I disconnect my machine from the internet (e.g. by unplugging the ethernet cable) in the middle of a Tune training, the trials hang forever. This seems unexpected when running things locally. If it's by design, then this would be a feature request and not a bug report 🙂

Ray version and other system information (Python version, TensorFlow version, OS): 0.8.2

Reproduction (REQUIRED)

Run the script below and disconnect your machine from the network after the first result.

import time

from ray import tune

class MyTrainableClass(tune.Trainable):
    def _setup(self, config):
        self.timestep = 0

    def _train(self):
        self.timestep += 1
        result = {"episode_reward_mean": self.timestep}
        if self.timestep == 100:
            result['done'] = True
        time.sleep(5)
        return result

tune.run(
    MyTrainableClass,
    name="network-test",
    num_samples=1,
    config={'a': 1})
richardliaw commented 4 years ago

Oh wow ... I think I know what the issue is (we look for the IP address at each step).

richardliaw commented 4 years ago

Actually, doesn't seem to be the case. This seems to be a Ray issue. Screenshot 2020-03-22 17 34 44

After internet shutoff -

Screenshot 2020-03-22 17 34 56

stale[bot] commented 4 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

hartikainen commented 4 years ago

I believe this is still an issue and should not be closed.

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

wjaskowski commented 3 years ago

I have experienced the same issue on ray 1.3.0