ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.07k stars 5.6k forks source link

Ray on a cluster: ConnectionError: Could not find any running Ray instance #10466

Closed Lewisracing closed 3 years ago

Lewisracing commented 4 years ago

I'm trying to test ray on a university cluster with the code below

import ray ray.init(address="auto") import time

@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ipaddress() set(ray.get([f.remote() for in range(1000)]))

But it returns error like this. Did I use ray in a wrong way or what?

File "", line 2, in File "/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py", line 643, in init address, redis_address) File "/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py", line 273, in validate_redis_address address = find_redis_address_or_die() File "/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py", line 165, in find_redis_address_or_die "Could not find any running Ray instance. " ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.

rkooo567 commented 4 years ago

You should start ray instance using ray start --head or ray start --address="[your head node address]" on your node.

Lewisracing commented 4 years ago

You should start ray instance using ray start --head or ray start --address="[your head node address]" on your node.

This solves it. But ray doesn't seems recognising all the nodes. I have two nodes for the test script but based on the info it seems only working on one of them.

2020-09-01 10:05:52,333 INFO scripts.py:395 -- Using IP address 10.0.1.101 for this node. 2020-09-01 10:05:52,336 INFO resource_spec.py:212 -- Starting Ray with 72.02 GiB memory available for workers and up to 34.86 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-01 10:05:52,592 WARNING services.py:923 -- Redis failed to start, retrying now. 2020-09-01 10:05:53,526 INFO services.py:1165 -- View the Ray dashboard at localhost:8265 2020-09-01 10:05:53,613 INFO scripts.py:425 -- Started Ray on this node. You can add additional nodes to the cluster by calling

ray start --address='10.0.1.101:6379' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

import ray
ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

ray stop

WARNING: Logging before InitGoogleLogging() is written to STDERR I0901 10:05:54.369500 609 609 global_state_accessor.cc:25] Redis server address = 10.0.1.101:6379, is test flag = 0 I0901 10:05:54.370640 609 609 redis_client.cc:141] RedisClient connected. I0901 10:05:54.378907 609 609 redis_gcs_client.cc:88] RedisGcsClient Connected. I0901 10:05:54.379729 609 609 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.

Also when I try my own code for 6nodes(96cpus), it still launch all the workers on one single node

rkooo567 commented 4 years ago

can you elaborate steps you took to connect two nodes?

Lewisracing commented 4 years ago

I'm not really familiar with the operation of the cluster system. I basically requires number of nodes in my job submittion script. PBS -l select=6:ncpus=16:mpiprocs=16 I'm not sure if it helps but on the cluster manual it says:, compute nodes, GPU nodes and storage are connected with an Infiniband EDR low-latency interconnect

rkooo567 commented 4 years ago

I see. You should setup ray instances per node by calling ray start. On a head node, run ray start —head, and on worker nodes, run ray start —address=[ip of redis on a head node]. That says, just follow the instruction in the output of ray start in your comment.

Patol75 commented 4 years ago

I have managed to have Ray run on a PBS cluster using the following script

#!/bin/bash
#PBS -l ncpus=192
#PBS -l mem=600GB
#PBS -l walltime=48:00:00
#PBS -l wd

module load python3/3.7.4

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)

echo parameters: $ip_head $redis_password

/path/to/ray start --head --port=6379 \
--redis-password=$redis_password \
--num-cpus 48 --num-gpus 0
sleep 10

for (( n=48; n<$PBS_NCPUS; n+=48 ))
do
  pbsdsh -n $n -v /path/to/startWorkerNode.sh \
  $ip_head $redis_password &
  sleep 10
done

cd /path/to/working/directory || exit
./Script.py --pw $redis_password

/path/to/ray stop

with startWorkerNode.sh being

#!/bin/bash -l

module load python3/3.7.4

/path/to/ray start --block --address=$1 \
--redis-password=$2 --num-cpus 48 --num-gpus 0

/path/to/ray stop

Within Script.py, I have

ray.init(address='auto', redis_password=args.pw)

where the Redis password is retrieved through argparse.

Hope that helps. :)

Lewisracing commented 3 years ago

@Patol75 Hi Thomas, I'm trying to replicate your code. However, my workers still work on a single node. The following is my code. Could you help me figure out what difference could lead to this issue? Many thanks

#!/bin/bash
#PBS -N V12_3P_L1_RAY
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q half_hour
#PBS -l application=Python
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
## Change to working directory
cd $PBS_O_WORKDIR
## Calculate number of CPUs
export cpus=`cat $PBS_NODEFILE | wc -l`

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password='5241590000000000'
echo parameters: $ip_head

module load PyTorch
ray start --head --port=6379 --redis-password=$redis_password
sleep 5

for (( n=4; n<=20; n+=4 ))
do
  pbsdsh -n $n -v /mnt/gpfs0/home/s267565/test_distributed_submission/V12_3P_L1_RAY/startWorkerNode.sh  $ip_head $PBS_O_WORKDIR 

done
sleep 5

module load PyTorch
python < Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID

And for the startWorkerNode.sh, I have

#!/bin/bash

module load PyTorch

ray start --address=$1 --redis-password='5241590000000000'
Patol75 commented 3 years ago

Hey @Lewisracing,

In my case, having #!/bin/bash -l in startWorkerNode.sh was necessary; I think it had something to do with environment variables not being loaded otherwise. Additionally, I think you need to --block in the ray start command in that same file, otherwise the pbsdsh process will simply finish and the worker will not be active. To go with that change, you also need to add an & at the end of the pbsdsh command within the for loop. It is necessary because, otherwise, the program would hang forever, waiting for the call to complete. This should, hopefully, fix your issue. Another possibility is to install the nightly version of Ray; I remember that I did not have any problem with Ray 0.8.7, but I could not manage to make Ray 1.0 work. Let me know how you go. :)

Lewisracing commented 3 years ago

Thanks @Patol75

Unfortunately all workers are still working on one node...

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password='5241590000000000'
echo parameters: $ip_head

module load PyTorch
ray start --head --port=6379 --redis-password=$redis_password
sleep 5

for (( n=4; n<=20; n+=4 ))
do
  pbsdsh -n $n -v /mnt/gpfs0/home/s267565/test_distributed_submission/V12_3P_L1_RAY/startWorkerNode.sh  $ip_head & 

done

startWorker_node.sh

#!/bin/bash -l

module load PyTorch

ray start --block --address=$1 --redis-password='5241590000000000'

Btw, what output do you have from PBS in your successful case? Not sure if this helps image

Patol75 commented 3 years ago

Have a look below. Which version of Ray are you using? Can you also include how you initialise Ray in your Python script? Can you also verify in the log that the address printed by echo matches the one reported above (i.e. 10.0.1.106:6379)?

job.sh.eXXX

Loading python3/3.8.5
  Loading requirement: intel-mkl/2020.2.254
2020-10-29 14:45:01,634 INFO services.py:1088 -- View the Ray dashboard at ^[[1m^[[32mhttp://localhost:8265^[[39m^[[22m
Loading python3/3.8.5^M
  Loading requirement: intel-mkl/2020.2.254^M
Loading python3/3.8.5^M
  Loading requirement: intel-mkl/2020.2.254^M
Loading python3/3.8.5^M
  Loading requirement: intel-mkl/2020.2.254^M
2020-10-29 14:45:20,941 INFO worker.py:673 -- Connecting to existing Ray cluster at address: 10.6.47.57:6379

job.sh.oXXX

2020-10-29 14:45:01,108 INFO scripts.py:467 -- Local node IP: 10.6.47.57
2020-10-29 14:45:01,662 SUCC scripts.py:497 -- --------------------
2020-10-29 14:45:01,662 SUCC scripts.py:498 -- Ray runtime started.
2020-10-29 14:45:01,662 SUCC scripts.py:499 -- --------------------
2020-10-29 14:45:01,662 INFO scripts.py:501 -- Next steps
2020-10-29 14:45:01,662 INFO scripts.py:502 -- To connect to this Ray runtime from another node, run
2020-10-29 14:45:01,663 INFO scripts.py:504 --   ray start --address='10.6.47.57:6379' --redis-password='5241590000000000'
2020-10-29 14:45:01,663 INFO scripts.py:509 -- Alternatively, use the following Python code:
2020-10-29 14:45:01,663 INFO scripts.py:512 -- import ray
2020-10-29 14:45:01,664 INFO scripts.py:513 -- ray.init(address='auto', _redis_password='5241590000000000')
2020-10-29 14:45:01,664 INFO scripts.py:521 -- If connection fails, check your firewall settings and network configuration.
2020-10-29 14:45:01,664 INFO scripts.py:526 -- To terminate the Ray runtime, run
2020-10-29 14:45:01,664 INFO scripts.py:527 --   ray stop
0:26:24.734784530941397
2020-10-29 15:11:45,721 VINFO scripts.py:739 -- Send termination request to `/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --store_socket_name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --object_manager_port=0 --min_worker_port=10000 --max_worker_port=10999 --node_manager_port=59009 --node_ip_address=10.6.47.57 --redis_address=10.6.47.57 --redis_port=6379 --num_initial_workers=96 --maximum_startup_concurrency=96 --static_resource_list=node:10.6.47.57,1.0,CPU,96,memory,2414,object_store_memory,770 --config_list=plasma_store_as_thread,True "--python_worker_command=/apps/python3/3.8.5/bin/python3 /home/157/td5646/.local/lib/python3.8/site-packages/ray/workers/default_worker.py --node-ip-address=10.6.47.57 --node-manager-port=59009 --object-store-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --raylet-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --redis-address=10.6.47.57:6379 --config-list=plasma_store_as_thread,True --temp-dir=/tmp/ray --metrics-agent-port=56606 --redis-password=5241590000000000" --java_worker_command= --cpp_worker_command= --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785 --metrics-agent-port=56606 --metrics_export_port=58740 "--agent_command=/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/new_dashboard/agent.py --redis-address=10.6.47.57:6379 --metrics-export-port=58740 --dashboard-agent-port=56606 --node-manager-port=59009 --object-store-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --raylet-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/logs --redis-password=5241590000000000" --object_store_memory=58535487897 --plasma_directory=/dev/shm --head_node` (via SIGTERM)
2020-10-29 15:11:45,721 VINFO scripts.py:739 -- Send termination request to `/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store -m 58535487897 -d /dev/shm -z` (via SIGTERM)
2020-10-29 15:11:45,722 VINFO scripts.py:739 -- Send termination request to `/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.6.47.57 --redis_port=6379 --config_list=plasma_store_as_thread,True --gcs_server_port=0 --metrics-agent-port=56606 --node-ip-address=10.6.47.57 --redis_password=5241590000000000` (via SIGTERM)
2020-10-29 15:11:45,723 VINFO scripts.py:739 -- Send termination request to `/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/monitor.py --redis-address=10.6.47.57:6379 --redis-password=5241590000000000` (via SIGTERM)
2020-10-29 15:11:45,723 VINFO scripts.py:739 -- Send termination request to `/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/log_monitor.py --redis-address=10.6.47.57:6379 --logs-dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/logs --redis-password 5241590000000000` (via SIGTERM)
2020-10-29 15:11:45,724 VINFO scripts.py:739 -- Send termination request to `"/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""` (via SIGTERM)
2020-10-29 15:11:45,725 VINFO scripts.py:739 -- Send termination request to `"/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:14537" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""` (via SIGTERM)
2020-10-29 15:11:45,726 VINFO scripts.py:739 -- Send termination request to `/home/157/td5646/.local/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --store_socket_name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --object_manager_port=0 --min_worker_port=10000 --max_worker_port=10999 --node_manager_port=59009 --node_ip_address=10.6.47.57 --redis_address=10.6.47.57 --redis_port=6379 --num_initial_workers=96 --maximum_startup_concurrency=96 --static_resource_list=node:10.6.47.57,1.0,CPU,96,memory,2414,object_store_memory,770 --config_list=plasma_store_as_thread,True "--python_worker_command=/apps/python3/3.8.5/bin/python3 /home/157/td5646/.local/lib/python3.8/site-packages/ray/workers/default_worker.py --node-ip-address=10.6.47.57 --node-manager-port=59009 --object-store-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --raylet-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --redis-address=10.6.47.57:6379 --config-list=plasma_store_as_thread,True --temp-dir=/tmp/ray --metrics-agent-port=56606 --redis-password=5241590000000000" --java_worker_command= --cpp_worker_command= --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785 --metrics-agent-port=56606 --metrics_export_port=58740 "--agent_command=/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/new_dashboard/agent.py --redis-address=10.6.47.57:6379 --metrics-export-port=58740 --dashboard-agent-port=56606 --node-manager-port=59009 --object-store-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/plasma_store --raylet-name=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/logs --redis-password=5241590000000000" --object_store_memory=58535487897 --plasma_directory=/dev/shm --head_node` (via SIGTERM)
2020-10-29 15:11:45,728 VINFO scripts.py:739 -- Send termination request to `/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/log_monitor.py --redis-address=10.6.47.57:6379 --logs-dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/logs --redis-password 5241590000000000` (via SIGTERM)
2020-10-29 15:11:45,730 VINFO scripts.py:739 -- Send termination request to `/apps/python3/3.8.5/bin/python3 -u /home/157/td5646/.local/lib/python3.8/site-packages/ray/new_dashboard/dashboard.py --host=localhost --port=8265 --redis-address=10.6.47.57:6379 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2020-10-29_14-45-01_113530_653785/logs --redis-password 5241590000000000` (via SIGTERM)
2020-10-29 15:11:45,730 SUCC scripts.py:758 -- Stopped all 10 Ray processes.

======================================================================================
                  Resource Usage on 2020-10-29 15:11:47:
   Job Id:             12992552.gadi-pbs
   Project:            xd2
   Exit Status:        0
   Service Units:      171.52
   NCPUs Requested:    192                    NCPUs Used: 192             
                                           CPU Time Used: 81:31:19                                   
   Memory Requested:   150.0GB               Memory Used: 71.63GB         
   Walltime requested: 01:00:00            Walltime Used: 00:26:48        
   JobFS requested:    400.0MB                JobFS used: 0B              
======================================================================================
Lewisracing commented 3 years ago

The address is verified. Ray verysion is 0.8.6 The Python script is as below

from Lewis_functions import *
import torch.multiprocessing

torch.multiprocessing.set_sharing_strategy('file_system')
import ray

ray.init(address="auto")
#ray.init()
if __name__ == '__main__':
    cfg = PPO_config()

    i=0
    while i<20 :
        print(i)
        i+=1

    learner = [Learner.remote(cfg)]
    workers = [Worker.remote(i, cfg) for i in range(cfg.number_of_workers)]
    all_actors = learner + workers
    ray.wait([actor.start_working.remote() for actor in all_actors])

When I manually request for 2 nodes then manually start ray on them, it actually works---the workers are properly distributed. But when I try to use the shell script to do the job, it doesn't work anymore. I suspect there's something wrong with the shell script...but I'm no expert..

Patol75 commented 3 years ago

Can you try to remove everything related to the Redis password in the shell scripts and try again?

Lewisracing commented 3 years ago

Yup... But still kinda same...

parameters: 10.0.1.8:6379
2020-11-21 17:45:15,123 INFO scripts.py:395 -- Using IP address 10.0.1.8 for this node.
2020-11-21 17:45:15,128 INFO resource_spec.py:212 -- Starting Ray with 75.1 GiB memory available for workers and up to 36.18 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-11-21 17:45:15,419 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-11-21 17:45:16,666 INFO services.py:1165 -- View the Ray dashboard at localhost:8265
2020-11-21 17:45:16,802 INFO scripts.py:425 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='10.0.1.8:6379' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop
pbsdsh: spawned task 0x00000000 on logical node 0 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
pbsdsh: spawned task 0x00000000 on logical node 12 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
pbsdsh: spawned task 0x00000000 on logical node 4 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
pbsdsh: spawned task 0x00000000 on logical node 16 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
pbsdsh: spawned task 0x00000000 on logical node 8 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1121 17:45:38.792421 21817 21817 global_state_accessor.cc:25] Redis server address = 10.0.1.8:6379, is test flag = 0
I1121 17:45:38.793691 21817 21817 redis_client.cc:141] RedisClient connected.
I1121 17:45:38.793738 21817 21817 redis_gcs_client.cc:88] RedisGcsClient Connected.
I1121 17:45:38.794679 21817 21817 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
Patol75 commented 3 years ago

How many processors are you requesting for the job? And how many processors per node are there in the queue you have chosen? In other words, can you explain the meaning of this line: #PBS -l select=2:ncpus=10:mpiprocs=10?

Lewisracing commented 3 years ago

Each node in the cluster has 16 processors. What I'm requesting is 2 groups of 10 processors. That means the PBS scheduler will give me a total of 20 processors from 2 nodes(10 from each)

Patol75 commented 3 years ago

So can you try for (( n=10; n<20; n+=10 ))?

The idea is that you only need one pbsdsh call per extra node. So actually, in that case, you do not even need the loop.

Lewisracing commented 3 years ago

Yeah that makes sense. Unfortunately it still doesn't work... Btw, based on my manual attempts, it doesn't hurt if I repeatedly initialize a worker node, does it?

Patol75 commented 3 years ago

Can you confirm you tried the command from my latest edit? If yes, can you post the log and error content?

Lewisracing commented 3 years ago

Yes I did the for (( n=10; n<20; n+=10 )) line. Below is the log. Actually no obvious error was reported but when I ssh log in the node and check the status of the node, still all workers are all in the head node while the other one doesn't seem to have anything running on it

parameters: 10.0.1.96:6379
2020-11-22 09:02:09,708 INFO scripts.py:395 -- Using IP address 10.0.1.96 for this node.
2020-11-22 09:02:09,712 INFO resource_spec.py:212 -- Starting Ray with 72.27 GiB memory available for workers and up to 34.97 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-11-22 09:02:09,993 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-11-22 09:02:10,903 INFO services.py:1165 -- View the Ray dashboard at localhost:8265
2020-11-22 09:02:10,978 INFO scripts.py:425 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='10.0.1.96:6379' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop
pbsdsh: spawned task 0x00000000 on logical node 10 event 2
pbsdsh: waiting on 1 spawned and 0 obits
pbsdsh: waiting on 0 spawned and 1 obits
pbsdsh: task 0x00000000 exit status 254
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1122 09:02:32.271049 18813 18813 global_state_accessor.cc:25] Redis server address = 10.0.1.96:6379, is test flag = 0
I1122 09:02:32.272019 18813 18813 redis_client.cc:141] RedisClient connected.
I1122 09:02:32.272071 18813 18813 redis_gcs_client.cc:88] RedisGcsClient Connected.
I1122 09:02:32.273057 18813 18813 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

The rest are simple irrelevant program output

Patol75 commented 3 years ago

Hum... OK, nothing obvious comes to mind now. Would you mind posting the updated versions of both shell scripts?

Lewisracing commented 3 years ago

Sure Main script:

#!/bin/bash
#PBS -N Test_RAY
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q half_hour
#PBS -l application=Python
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
cd $PBS_O_WORKDIR
export cpus=`cat $PBS_NODEFILE | wc -l`

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
echo parameters: $ip_head

module load PyTorch
ray start --head --port=6379 
sleep 5

for (( n=10; n<20; n+=10 ))
do
  pbsdsh -n $n -v /mnt/gpfs0/home/s267565/test_distributed_submission/V12_3P_L1_RAY/startWorkerNode.sh  $ip_head & 

done

sleep 5

module load PyTorch
python < Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

startWorkerNode.sh:

#!/bin/bash -l

module load PyTorch

ray start --block --address=$1 

Thanks

Patol75 commented 3 years ago

I do not see any problem in these scripts, unfortunately. However, I still have two suggestions. Within your python script, can you get rid of if __name__ == '__main__': and see if that helps? Additionally, can you try upgrading Ray to either 0.8.7 (`python3 -m pip install --user --upgrade ray==0.8.7) or the latest nightly build (I linked it above, same procedure with pip)?

Lewisracing commented 3 years ago

Okay here's an update... I managed to get help to rewrite the sub and sh scripts

submit script:

#!/bin/bash

#PBS -N pythoncpu_testray
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -q five_day
#PBS -m abe
#PBS -M aaa@aaa.ac.uk  
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n
##source $HOME/.bashrc
ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID
## Change to working directory
cd $PBS_O_WORKDIR
# Get node list from PBS and format for job:
JOB_NODES=`uniq -c ${PBS_NODEFILE} | awk -F. '{ print $1 }' | awk '{print $2 ":" $1}' | paste -s -d ':'`  
##

echo "pbs_o_workdir is: $PBS_O_WORKDIR"
thishost=`uname -n`
thishostip=`hostname -i`
echo "thishost=[$thishost]"
echo "thishostip=[$thishostip]"
suffix=':6379'
ip_head=$thishostip$suffix
red_password='5241590000000000'

module load PyTorch/1.6.0-foss-2019b-Python-3.7.4 
module list
ray start --head --port=6379 
sleep 5

##chmod +x startWorkerNode.sh
pbsdsh  $PBS_O_WORKDIR/startWorkerNode.sh "${ip_head}" "${red_password}"
## Tidy up the log directory
## DO NOT CHANGE THE LINE BELOW
## ============================

sleep 20

python <Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

and the startWorkerNode.sh :

#!/bin/bash -l
source $HOME/.bashrc

cd $PBS_O_WORKDIR

param1=$1
param2=$2

echo ${param1}
echo ${param2}

destnode=`uname -n`
echo "destnode is = [$destnode]"

module load PyTorch
ray start --address="${param1}" --redis-password="${param2}"

From the log file, you can see that the connection is successful. (I think it is....)

thishost=[dcn036.cluster]
thishostip=[10.0.1.36]

2020-11-27 12:48:54,207 INFO services.py:1092 -- View the Ray dashboard at http://localhost:8265
2020-11-27 12:48:52,905 INFO scripts.py:467 -- Local node IP: 10.0.1.36
2020-11-27 12:48:54,248 SUCC scripts.py:497 -- --------------------
2020-11-27 12:48:54,248 SUCC scripts.py:498 -- Ray runtime started.
2020-11-27 12:48:54,248 SUCC scripts.py:499 -- --------------------
2020-11-27 12:48:54,248 INFO scripts.py:501 -- Next steps
2020-11-27 12:48:54,248 INFO scripts.py:503 -- To connect to this Ray runtime from another node, run
2020-11-27 12:48:54,248 INFO scripts.py:507 --   ray start --address='10.0.1.36:6379' --redis-password='5241590000000000'
2020-11-27 12:48:54,248 INFO scripts.py:509 -- Alternatively, use the following Python code:
2020-11-27 12:48:54,249 INFO scripts.py:512 -- import ray
2020-11-27 12:48:54,249 INFO scripts.py:519 -- ray.init(address='auto', _redis_password='5241590000000000')
2020-11-27 12:48:54,249 INFO scripts.py:522 -- If connection fails, check your firewall settings and network configuration.
2020-11-27 12:48:54,249 INFO scripts.py:526 -- To terminate the Ray runtime, run
2020-11-27 12:48:54,249 INFO scripts.py:527 --   ray stop
10.0.1.36:6379
5241590000000000
10.0.1.36:6379
5241590000000000
destnode is = [dcn036.cluster]
10.0.1.36:6379
5241590000000000
10.0.1.36:6379
5241590000000000
destnode is = [dcn036.cluster]
destnode is = [dcn036.cluster]
destnode is = [dcn036.cluster]
10.0.1.36:6379
10.0.1.36:6379
5241590000000000
10.0.1.36:6379
10.0.1.36:6379
5241590000000000
5241590000000000
5241590000000000
destnode is = [dcn046.cluster]
destnode is = [dcn046.cluster]
destnode is = [dcn046.cluster]
destnode is = [dcn046.cluster]
2020-11-27 12:48:15,039 INFO scripts.py:591 -- Local node IP: 10.0.1.46
2020-11-27 12:48:15,070 SUCC scripts.py:606 -- --------------------
2020-11-27 12:48:15,071 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:48:15,071 SUCC scripts.py:608 -- -2020-11-27 12:48:15,039 INFO scripts.py:591 -- Local node IP: 10.0.1.46
2020-11-27 12:48:15,070 SUCC scripts.py:606 -- --------------------
2020-11-27 12:48:15,071 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:48:15,071 SUCC scripts.py:608 -- --------------------
2020-11-27 12:48:15,071 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:48:15,071 INFO scripts.py:611 --   ray stop
-------------------
2020-11-27 12:48:15,071 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:48:15,071 INFO scripts.py:611 --   ray stop
2020-11-27 12:48:15,040 INFO scripts.py:591 -- Local node IP: 10.0.1.46
2020-11-27 12:48:15,071 SUCC scripts.py:606 -- --------------------
2020-11-27 12:48:15,071 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:48:15,072 SUCC scripts.py:608 -- --------------------
2020-11-27 12:48:15,072 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:48:15,072 INFO scripts.py:611 --   ray stop
2020-11-27 12:48:15,107 INFO scripts.py:591 -- Local node IP: 10.0.1.46
2020-11-27 12:48:15,144 SUCC scripts.py:606 -- --------------------
2020-11-27 12:48:15,144 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:48:15,144 SUCC scripts.py:608 -- --------------------
2020-11-27 12:48:15,144 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:48:15,144 INFO scripts.py:611 --   ray stop
2020-11-27 12:49:06,809 INFO scripts.py:591 -- Local node IP: 10.0.1.36
2020-11-27 12:49:06,855 SUCC scripts.py:606 -- --------------------
2020-11-27 12:49:06,855 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:49:06,855 SUCC scripts.py:608 -- --------------------
2020-11-27 12:49:06,855 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:49:06,855 INFO scripts.py:611 --   ray stop
2020-11-27 12:49:06,809 INFO scripts.py:591 -- Local node IP: 10.0.1.36
2020-11-27 12:49:06,855 SUCC scripts.py:606 -- --------------------
2020-11-27 12:49:06,855 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:49:06,856 SUCC scripts.py:608 -- --------------------
2020-11-27 12:49:06,856 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:49:06,856 INFO scripts.py:611 --   ray stop
2020-11-27 12:49:06,809 INFO scripts.py:591 -- Local node IP: 10.0.1.36
2020-11-27 12:49:06,862 SUCC scripts.py:606 -- --------------------
2020-11-27 12:49:06,862 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:49:06,863 SUCC scripts.py:608 -- --------------------
2020-11-27 12:49:06,863 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:49:06,863 INFO scripts.py:611 --   ray stop
2020-11-27 12:49:06,809 INFO scripts.py:591 -- Local node IP: 10.0.1.36
2020-11-27 12:49:06,863 SUCC scripts.py:606 -- --------------------
2020-11-27 12:49:06,863 SUCC scripts.py:607 -- Ray runtime started.
2020-11-27 12:49:06,863 SUCC scripts.py:608 -- --------------------
2020-11-27 12:49:06,863 INFO scripts.py:610 -- To terminate the Ray runtime, run
2020-11-27 12:49:06,864 INFO scripts.py:611 --   ray stop
2020-11-27 12:49:36,541 INFO worker.py:651 -- Connecting to existing Ray cluster at address: 10.0.1.36:6379

But still.... the workers all work on the head node...

Lewisracing commented 3 years ago

@rkooo567 What could be the cause of 'Ray runtime started' on different nodes but workers still work on the head node?

Patol75 commented 3 years ago

Hey @Lewisracing, Thanks for the update. I would still suggest to have --block in the Ray start command of startWorkerNode. Additionally, it should not be necessary to communicate with all cores through pbsdsh: only a single core on a different node should be sufficient.

Lewisracing commented 3 years ago

Great news--Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:

#!/bin/bash

#PBS -N pythoncpu_testray
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q five_day
#PBS -m abe
#PBS -M xxx@xxx.xx  
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n

ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID

cd $PBS_O_WORKDIR

jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`

thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379

thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"

echo "set up ray cluster..." 
for n in `echo ${jobnodes}`
do
        if [[ ${n} == "${thishost}" ]]
        then
                echo "first allocate node - use as headnode ..."
                module load PyTorch
                ray start --head
                sleep 5
        else
                ssh ${n}  $PBS_O_WORKDIR/startWorkerNode.sh ${thishostNport}
                sleep 10
        fi
done 

python <Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

The startWorkerNode.sh script:

#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load PyTorch
ray start --address="${param1}" --redis-password='5241590000000000'

Note that for the PBS cluster I'm using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file

chmod +x startWorkerNode.sh

I hope this is a general solution for everyone. I finally made it work with huge help from my uni's HPC specialist

richardliaw commented 3 years ago

cc @amogkam

richardliaw commented 3 years ago

@Lewisracing that's awesome to hear!

Patol75 commented 3 years ago

@Lewisracing Glad you got it to work! Just one potential concern: is it possible that the first iteration of the loop targets a worker node rather than the head one?

Lewisracing commented 3 years ago

@Patol75 In this case, the first allocated node will always be taken as the head node. so... I'm wondering what would lead to the possibility you mentioned?

Patol75 commented 3 years ago

If that's the case, then no worries. :)

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!