ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Core] BUG: Cluster crashes when using temp_dir "could not connect to socket" raylet.x [since 2.7+] #44431

Open thoglu opened 5 months ago

thoglu commented 5 months ago

What happened + What you expected to happen

In the newer versions (since ray 2.7, but not in 2.6.3) I get the following error when submitting a script to a local cluster that has worker nodes connected to it AND using a temp dir.

Error appears: start head node -> add workers -> submit job / start script on head node (error appears)

Error does not appear 1) start head -node -> submit job /start script -> then add workers (autoscale), no error anymore... However, once I stop a script running on the cluster, and then submit it again, it will crash in the same way. 2) NOT using a tempdir -> no error anymore

The error is of the following type:

raylet_client.cc:60: Could not connect to socket "logdir_path/sockets/raylet.3" .. followed by a bunch of stacktraces. It could also be raylet.2 or any other raylet.x ...

Again, it only happens with a temp_dir, and the error is not an issue in ray 2.6.3 or older, only in newer versions.

Versions / Dependencies

ray 2.7 or newer

Reproduction script

See above simple pipeline, with any basic ray script e.g.

1) on head node: ray start --head --temp_dir=/temp_dir/ 2) on worker node: connect a worker via "ray start" and connect to the head node 3) execute a script on the cluster, e.g. the following on the head node:

import ray

ray.init(address='auto', _temp_dir="/temp_dir/")

@ray.remote
def square(x):
    return x * x

futures = [square.remote(i) for i in range(4)]

print(ray.get(futures))

Issue Severity

This issue is pretty severe, since I cannot use the temp dir anymore or have to restart the whole ray instance anytime I want to start a new training job (Again the order matters: It is possible to use a temp dir, by first starting the head node, then sending the script, and only then scaling the cluster .. or not using any temp dir at all, but that is not possible for me).

sylv01 commented 5 months ago

I am having this same issue in version 2.9.2 and 2.9.3. We've been using ray for years on fairly large HPCs and have not seen this error before. The error raylet_client.cc:60: Could not connect to socket ... only pops up when doing multi-node clusters. Single node ray clusters are not showing this error.

Some notes on my setup...

Host 1 : ray start --head --port=6379 --redis-password=redacted --num-cpus=96 --temp-dir=/path/on/nfs Host 2 : ray start --address=host1:6379 --redis-password=redacted --temp-dir=/path/on/nfs --num-cpus=96

Both hosts appear to start ray as expected with no errors.

Simply trying to connect to the cluster on host 1 will generate the error.

python -c "import ray; ray.init(address='auto', _redis_password='redacted') ; print('Ray nodes: ', len(ray.nodes()))"

Edit 1:

One important thing to note here is that the temp dir path is on a shared filesystem. We haven't had issues in the past using NFS path for a temp dir. Could it be that host1 or host2 are trying to read/write to socket files that were created by the other host?

Edit 2:

Same error with ray 2.10.0 as well

LennartPurucker commented 4 months ago

I can reproduce this issue on my cluster. Downgrading to 2.6.3 works as well (thanks for mentioning that!).

keatincf commented 4 months ago

I've run into this issue as well with a setup where temp_dir is a directory on a NFS. If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes, I won't see the failure to connect to the raylet socket.

LennartPurucker commented 4 months ago

If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes

This also worked for me with individual ray tasks and was my go-to workaround before finding that downgrading helps, too.

However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's temp_dir.

keatincf commented 4 months ago

However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's temp_dir.

With Singularity, I was able to different directories to the directory pointed to by temp_dir. So, each node still uses the path specified by the head's temp_dir, but that path points to a different directory.

thoglu commented 4 months ago

But this is mere a workaround, right? Maybe some of the developers (@rkooo567) can chime in?

keatincf commented 4 months ago

Yes, it feels like it's just a workaround to me, unless there was an expectation that temp_dir isn't a directory shared across nodes. Either there's missing documentation about that expectation or there's a bug with how temp_dir is leveraged by the nodes.

jjyao commented 3 months ago

@thoglu are your head nodes and workers nodes run on the same machine so that temp_dir of head node and worker nodes points to the same physical directory?

thoglu commented 3 months ago

@jjyao no the headnodes and worker nodes run on different machines with a shared filesystem. The temp_dir is only defined on the headnode in ray start --temp_dir=... , doing the same on the worker nodes says that temp_dir is ignored on the worker nodes and should be defined on the head node, so it is not passed there.

Again, in 2.6.3 it works, starting at 2.7.0 this setup fails with the above error.

rynewang commented 3 months ago

Hypothesis: we create unix sockets in temp dirs, but NFS does not support unix sockets so the creation failed, then the reading failed.

thoglu commented 3 months ago

@rynewang and that happens since 2.7.0 ? Remember the issue is not present in 2.6.3 and before. If I switch off the custom temp dir, and ray used the default (in /tmp/... I believe), it works.... On top of the temp dir issue, in the new version I was not able to save logging artifacts (plots etc) within each job subdirectory (using log_to_file=True), instead they were saved in the logging dir under "artifacts".. in the documentation here: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html , it seems I need to specify "storage_path" and put the shared filesystem path there?