ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[autoscaler] Check failed: _s.ok() Heartbeat failed: NotImplemented #8883

Open yutaizhou opened 4 years ago

yutaizhou commented 4 years ago

What is the problem?

When calling ray start on worker nodes in SLURM system, not all worker nodes can be started properly. (In my experience, at most 7 nodes have been started properly, while the most I am allowed to call is 16, as set by my account limit)

 12 Worker node hostname: d-12-5-2
 11 2020-06-10 13:18:19,822 INFO scripts.py:429 -- Using IP address 172.31.130.57 for this node.
 10 2020-06-10 13:18:19,827 INFO resource_spec.py:212 -- Starting Ray with 249.56 GiB memory available for workers and up to 106.96 GiB for objects. You can adjust these sett    ings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
  9 2020-06-10 13:18:19,846 INFO scripts.py:438 -- 
  8 Started Ray on this node. If you wish to terminate the processes that have been started, run
  7 
  6     ray stop
  5 2020-06-10 13:18:20,846 ERROR scripts.py:447 -- Ray processes died unexpectedly:
  4 2020-06-10 13:18:20,848 ERROR scripts.py:450 --         raylet died with exit code -6
  3 2020-06-10 13:18:20,848 ERROR scripts.py:452 -- Killing remaining processes and exiting...

Here is output from a raylet ERROR/FATAL file: (both kind of files output the same thing)

1   Log file created at: 2020/06/10 14:49:31                                                                                                                                  
  1 Running on machine: d-9-7-2
  2 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
  3 F0610 14:49:31.038458 60961 node_manager.cc:340]  Check failed: _s.ok() Heartbeat failed: NotImplemented: ...

Ray version and other system information (Python version, TensorFlow version, OS): Python: 3.6.10 Ray: 0.8.4

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

submit.sh

#!/bin/bash

#SBATCH --job-name=ray_trainer
#SBATCH -o out-ray_%j.log
#SBATCH -N 16
#SBATCH -n 80
#SBATCH --exclusive
#SBATCH --gres=gpu:volta:2

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

# purge any loaded modules
module purge > /dev/null 2>&1

((worker_num=$SLURM_NNODES-1)) # Must be one less that the total number of nodes
echo "Worker num: $worker_num"

# Set up environment
eval "$(conda shell.bash hook)"

# conda activate football2

export PYTHONNOUSERSITE=True
export PYTHONPATH=${PYTHONPATH}:~/documents/merlin/multiagent_football/src/envs

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

node1=${nodes_array[0]}

# Set IP address/port of head node
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
port='6379'
ip_head=$ip_prefix':'$port
export ip_head # Exporting for latter access by trainer.py

echo "Head node: $node1 - $ip_head"

# Temporary directory for logging
tmpdir='/state/partition1/user/'$USER'/raytmp'
echo $tmpdir
mkdircmd='mkdir -p '$tmpdir

# Remove CUDA environment variables
unset CUDA_VISIBLE_DEVICES
# echo "GPUs: $CUDA_VISIBLE_DEVICES" # may not need to unset on current node

# Start the head
echo "Starting Ray on Head Node"
srun --nodes=1 --ntasks=1 -w $node1 $mkdircmd
srun --nodes=1 --ntasks=1 -w $node1 unset.sh && ray start --temp-dir=$tmpdir --block --head --redis-port=$port &
echo "Head node started"
sleep 30

# Start workers
echo "Adding workers"
for ((  i=1; i<=$worker_num; i++ ))
do
    node2=${nodes_array[$i]}
    echo "Worker node hostname: $node2"
    srun --nodes=1 --ntasks=1 -w $node2 mkdir -p $tmpdir
    srun --nodes=1 --ntasks=1 -w $node2 unset.sh && ray start --temp-dir=$tmpdir --block --address=$ip_head & # Starting the workers
    sleep 30
done
echo "Worker nodes started"

echo "slurm ntasks: $SLURM_NTASKS"
time python test.py $SLURM_NTASKS

test.py


# trainer.py
import os
import sys
import time
import torch 
import ray

ray.init(address=os.environ["ip_head"])

@ray.remote
def f(index):
    print(f"{index}: {torch.cuda.is_available()}")
    time.sleep(1)

# The following takes one second (assuming that ray was able to access all of the allocated nodes).
start = time.time()
num_cpus = int(sys.argv[1])
ray.get([f.options(num_gpus=0.1).remote(i) for i in range(num_cpus)])
end = time.time()
print(end - start)

print("ray cluster resources")
print(ray.cluster_resources())

print("ray nodes")
for node in ray.nodes():
    print(node)

If we cannot run your script, we cannot fix your issue.

stephanie-wang commented 4 years ago

@micafan it looks like the check is failing here and you were the last to edit this file. Can you take a look?

yutaizhou commented 4 years ago

I'd like to emphasize that not all worker nodes fail. Most I have gotten spun up is 7 (out of 16 total nodes )

mehrdadn commented 4 years ago

From what I understand this is due to the GCS server and Redis dying at different times. (Their lifetimes were expected to be the same.) I'm not too familiar with the problem though. CC @wumuzi520 perhaps?

micafan commented 4 years ago

@micafan it looks like the check is failing here and you were the last to edit this file. Can you take a look?

I didn't add this line. But I think this line is reasonable. You shouldn't use RedisAsyncContext after you called the method ResetRawRedisAsyncContext. RedisAsyncContext is only a wrapper of redisAsyncContext, add ResetRawRedisAsyncContext will reset the member ptr redis_async_context_ to nullptr.

richardliaw commented 4 years ago

@micafan is there a bug in the code?

Also, what should the user do in this case?

micafan commented 4 years ago

@micafan is there a bug in the code?

Also, what should the user do in this case?

I am not familiar with this test case, but it seems GCSClient already disconnected with Redis, so the crash here isn't unexpected.
GCSClient will support reconnect with Redis in future, by then the problem will be solved.

yutaizhou commented 4 years ago

Is there anything that a user can do aside from waiting for the support? I'd like to run some experiments for a HPC conference paper with 64 nodes (current limit at 16, but I plan to ask for an extension from our HPC team), but if Ray can only spin up 7 out of the 16 I have, it's hard to be optimistic and think "well that's almost half of the allocated nodes, so with 64 as the limit, I should be able to get around 30"

richardliaw commented 4 years ago

@yutaizhou do you have any logs of redis? (i.e., all of the logs in /tmp/ray/session_latest/logs would be helpful here)

stephanie-wang commented 4 years ago

@micafan is there a bug in the code? Also, what should the user do in this case?

I am not familiar with this test case, but it seems GCSClient already disconnected with Redis, so the crash here isn't unexpected. GCSClient will support reconnect with Redis in future, by then the problem will be solved.

@micafan, I don't think that is the full story here. The fact that some nodes are able to connect means that Redis/GCS is still alive, right? So why are only some of the nodes failing with that error?

yutaizhou commented 4 years ago

@yutaizhou do you have any logs of redis? (i.e., all of the logs in /tmp/ray/session_latest/logs would be helpful here)

@richardliaw i actually get nothing. The $tmpdir that i pass here is empty, for both nodes that spun up and didn't.

srun --nodes=1 --ntasks=1 -w $node1 unset.sh && ray start --temp-dir=$tmpdir --block --head --redis-port=$port &

tmp/ray/session_latest/logs does not get created, presumably because I already pass in $tmpdir

mehrdadn commented 4 years ago

@yutaizhou This isn't officially documented (so it might change/stop working in the future), but at the moment it should work: could you try running Ray after setting the environment variable GLOG_logtostderr=1? It should print everything to the terminal.