Closed netw0rkf10w closed 4 years ago
cc @sumanthratna could you help take a look?
@netw0rkf10w thanks for this long write up! I'll try to answer a couple:
--num-cpus
is only accounting/an override value. It doesn't necessarily correspond with physical CPUs. You should set it to --num-cpus=3
which corresponds with --cpus-per-task=3
(assuming each machine in your cluster only has 3 CPUs).srun --nodes=1 --ntasks=8
I'm not quite sure what this does, but perhaps you should increase number of nodes too.Have you looked at https://docs.ray.io/en/master/cluster/slurm.html?
@richardliaw Thanks a lot for the prompt reply!
--num-cpus
is only accounting/an override value. It doesn't necessarily correspond with physical CPUs. You should set it to--num-cpus=3
which corresponds with--cpus-per-task=3
(assuming each machine in your cluster only has 3 CPUs).
I already tried --num-cpus=3
but then Tune had access to only 3 CPUs, which raised a resource error because for each trial I requested 8GPUs and 24 CPUs:
distributed_train = DistributedTrainableCreator(
partial(train_function, args=args),
use_gpu=True,
num_workers=8, # number of GPUs per trial
num_cpus_per_worker=3,
backend="nccl",
timeout_s=60)
srun --nodes=1 --ntasks=8
I'm not quite sure what this does, but perhaps you should increase number of nodes too.
Well, on my Slurm server, I was instructed that --ntasks
is the total number of GPUs requested (thus: --ntasks=8
means 1 node and --ntasks=16
means 2 nodes).
Have you looked at https://docs.ray.io/en/master/cluster/slurm.html?
Actually that was the first thing I looked at, and it took me a while to adapt the example to my use case. In that example, each node has 1 task, but in my case --ntasks
corresponds to the number of GPUs (i.e. 8 per node). At least this is what I have been doing for PyTorch distributed trainings (8 GPUs per node, so 8 tasks per node). Let me try packing the 8 GPUs into a single to task to see if it works...
I tried 1 single task with 24 CPUs:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=24
...
# Start the worker nodes
srun --nodes=1 --ntasks=1 --cpus-per-task=24 --exclude=`hostname` ~/.local/bin/ray start --address $ip_head --block --num-cpus 24 &
sleep 5
but it didn't work:
sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available
A clarification just in case: I managed to get Tune working on a single node, except that I'm not sure if the trials used all the 24 CPUs available on the node, or if they shared the same 3 CPUs between them because I launched Ray using
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24
(It's late in Europe, I'll get back in 8-10 hours. Thanks.)
@netw0rkf10w can you post the output of the above Tune run with:
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24
Can you also try:
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())'
# my hypothesis is that it will return 24.
From what I understand, here is what it seems like:
srun
with ~/.local/bin/ray start --head
srun
- this is only relevant for something like pytorch DDP, which requires you to run the same script on each GPU. You do not need to do this for Ray.Another followup question:
#!/bin/bash
#SBATCH --ntasks=2 # total number of tasks
#SBATCH --ntasks-per-node=1 # number of tasks per node
#SBATCH --gres=gpu:8 # number of GPUs per node
#SBATCH --cpus-per-task=3 # number of cores per node <-------------- can you increase this? or is there a system limit?
Hi @richardliaw !
@netw0rkf10w can you post the output of the above Tune run with:
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24
The logs are shown below where I set 1 GPU for each trial (the logs is for --num-cpus 25
but for --num-cpus 24
they are similar):
2020-09-24 01:41:45,662 INFO scripts.py:465 -- Using IP address 10.148.8.10 for this node. 2020-09-24 01:41:45,663 INFO resource_spec.py:231 -- Starting Ray with 544.19 GiB memory available for workers and up to 186.26 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-24 01:41:46,343 INFO services.py:1193 -- View the Ray dashboard at 10.148.8.10:8265 2020-09-24 01:41:46,399 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling ray start --address='10.148.8.10:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
WARNING: Logging before InitGoogleLogging() is written to STDERR I0924 01:41:52.687922 8548 8548 global_state_accessor.cc:25] Redis server address = 10.148.8.10:6379, is test flag = 0 I0924 01:41:52.698506 8548 8548 redis_client.cc:146] RedisClient connected. I0924 01:41:52.706383 8548 8548 redis_gcs_client.cc:89] RedisGcsClient Connected. I0924 01:41:52.707314 8548 8548 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.8.10:34015 I0924 01:41:52.707515 8548 8548 service_based_accessor.cc:92] Reestablishing subscription for job info. I0924 01:41:52.707527 8548 8548 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0924 01:41:52.707535 8548 8548 service_based_accessor.cc:797] Reestablishing subscription for node info. I0924 01:41:52.707541 8548 8548 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0924 01:41:52.707547 8548 8548 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0924 01:41:52.707553 8548 8548 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0924 01:41:52.707561 8548 8548 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': 'd74aff9bad58c110345691f5b9697174cc2372d6', 'Alive': True, 'NodeManagerAddress': '10.148.8.10', 'NodeManagerHostname': 'node1', 'NodeManagerPort': 61334, 'ObjectManagerPort': 38197, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-24_01-41-45_662966_8535/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-24_01-41-45_662966_8535/sockets/raylet', 'alive': True, 'Resources': {'GPUType:V100': 1.0, 'memory': 11145.0, 'CPU': 25.0, 'node:10.148.8.10': 1.0, 'object_store_memory': 2632.0, 'GPU': 8.0}}] == Status == Memory usage on this node: 24.4/754.3 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 2.000: None | Iter 1.000: None Resources requested: 3/25 CPUs, 1/8 GPUs, 0.0/544.19 GiB heap, 0.0/128.52 GiB objects (0/1.0 GPUType:V100) Result logdir:
/ray_results/WrappedDistributedTorchTrainable Number of trials: 16 (15 PENDING, 1 RUNNING) +----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+ | Trial name | status | loc | alpha | beta | bilateral_weight | blur | gamma | spatial_weight | steps | window | |----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------| | WrappedDistributedTorchTrainable_5b88f_00000 | RUNNING | | 1.00825 | 0.0102361 | 260.833 | 2 | 33.811 | 20.5276 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00001 | PENDING | | 50.2677 | 0.108875 | 282.037 | 4 | 240.289 | 205.205 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00002 | PENDING | | 333.117 | 0.0323349 | 9.59502 | 1 | 198.175 | 600.021 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00003 | PENDING | | 4.48639 | 0.215188 | 14.553 | 1 | 49.9334 | 9.37293 | 2 | 9 | | WrappedDistributedTorchTrainable_5b88f_00004 | PENDING | | 43.2839 | 0.0010359 | 218.981 | 4 | 73.9295 | 1.58913 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00005 | PENDING | | 148.056 | 0.369112 | 14.5907 | 1 | 46.9408 | 2.32398 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00006 | PENDING | | 24.3175 | 0.724971 | 24.9273 | 1 | 45.138 | 1.21104 | 1 | 9 | | WrappedDistributedTorchTrainable_5b88f_00007 | PENDING | | 5.86012 | 0.00181386 | 7.42448 | 1 | 137.955 | 87.5108 | 1 | 11 | | WrappedDistributedTorchTrainable_5b88f_00008 | PENDING | | 1.20372 | 0.00784292 | 108.812 | 2 | 422.599 | 27.7823 | 3 | 9 | | WrappedDistributedTorchTrainable_5b88f_00009 | PENDING | | 8.98295 | 0.00280868 | 10.5628 | 4 | 16.9054 | 116.381 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00010 | PENDING | | 59.0022 | 0.312436 | 32.4772 | 4 | 5.07771 | 6.54218 | 3 | 9 | | WrappedDistributedTorchTrainable_5b88f_00011 | PENDING | | 50.4253 | 0.00763172 | 6.23578 | 4 | 293.568 | 136.231 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00012 | PENDING | | 35.6073 | 0.186986 | 101.348 | 2 | 2.25512 | 5.87278 | 1 | 11 | | WrappedDistributedTorchTrainable_5b88f_00013 | PENDING | | 763.593 | 0.856251 | 319.073 | 4 | 1.27139 | 2.50655 | 3 | 7 | | WrappedDistributedTorchTrainable_5b88f_00014 | PENDING | | 22.8901 | 0.00167951 | 327.988 | 1 | 399.009 | 57.7956 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00015 | PENDING | | 1.41336 | 0.0258321 | 1.33932 | 1 | 143.696 | 137.824 | 2 | 11 | +----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+ Saving checkpoint to /ray_results/WrappedDistributedTorchTrainable/WrappedDistributedTorchTrainable_6_alpha=24.318,beta=0.72497,bilateral_weight=24.927,blur=1,gamma=45.138,spatial_weight=1.211,step_2020-09-24_01-41-53nnki37bt/worker_0/checkpoint_0/checkpoint.pth Result for WrappedDistributedTorchTrainable_5b88f_00006: accuracy: 0.2075007382104213 date: 2020-09-24_01-45-42 done: false experiment_id: 1555c7fa0aba4af1a036bcc6d927a2fa experiment_tag: 6_alpha=24.318,beta=0.72497,bilateral_weight=24.927,blur=1,gamma=45.138,spatial_weight=1.211,steps=1,window=9 hostname: jean-zay-ia810 iterations_since_restore: 1 loss: 5.2999533758163455 node_ip: 10.148.8.10 pid: 8676 should_checkpoint: true time_since_restore: 218.52256321907043 time_this_iter_s: 218.52256321907043 time_total_s: 218.52256321907043 timestamp: 1600904742 timesteps_since_restore: 0 training_iteration: 1 trial_id: 5b88f_00006 == Status == Memory usage on this node: 48.9/754.3 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 2.000: None | Iter 1.000: 0.2075007382104213 Resources requested: 24/25 CPUs, 8/8 GPUs, 0.0/544.19 GiB heap, 0.0/128.52 GiB objects (0/1.0 GPUType:V100) Result logdir:
/ray_results/WrappedDistributedTorchTrainable Number of trials: 16 (8 PENDING, 8 RUNNING) +----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+ | Trial name | status | loc | alpha | beta | bilateral_weight | blur | gamma | spatial_weight | steps | window | loss | accuracy | training_iteration | |----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------| | WrappedDistributedTorchTrainable_5b88f_00000 | RUNNING | | 1.00825 | 0.0102361 | 260.833 | 2 | 33.811 | 20.5276 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00001 | RUNNING | | 50.2677 | 0.108875 | 282.037 | 4 | 240.289 | 205.205 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00002 | RUNNING | | 333.117 | 0.0323349 | 9.59502 | 1 | 198.175 | 600.021 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00003 | RUNNING | | 4.48639 | 0.215188 | 14.553 | 1 | 49.9334 | 9.37293 | 2 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00004 | RUNNING | | 43.2839 | 0.0010359 | 218.981 | 4 | 73.9295 | 1.58913 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00005 | RUNNING | | 148.056 | 0.369112 | 14.5907 | 1 | 46.9408 | 2.32398 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00006 | RUNNING | 10.148.8.10:8676 | 24.3175 | 0.724971 | 24.9273 | 1 | 45.138 | 1.21104 | 1 | 9 | 5.29995 | 0.207501 | 1 | | WrappedDistributedTorchTrainable_5b88f_00007 | RUNNING | | 5.86012 | 0.00181386 | 7.42448 | 1 | 137.955 | 87.5108 | 1 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00008 | PENDING | | 1.20372 | 0.00784292 | 108.812 | 2 | 422.599 | 27.7823 | 3 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00009 | PENDING | | 8.98295 | 0.00280868 | 10.5628 | 4 | 16.9054 | 116.381 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00010 | PENDING | | 59.0022 | 0.312436 | 32.4772 | 4 | 5.07771 | 6.54218 | 3 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00011 | PENDING | | 50.4253 | 0.00763172 | 6.23578 | 4 | 293.568 | 136.231 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00012 | PENDING | | 35.6073 | 0.186986 | 101.348 | 2 | 2.25512 | 5.87278 | 1 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00013 | PENDING | | 763.593 | 0.856251 | 319.073 | 4 | 1.27139 | 2.50655 | 3 | 7 | | | | | WrappedDistributedTorchTrainable_5b88f_00014 | PENDING | | 22.8901 | 0.00167951 | 327.988 | 1 | 399.009 | 57.7956 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00015 | PENDING | | 1.41336 | 0.0258321 | 1.33932 | 1 | 143.696 | 137.824 | 2 | 11 | | | | +----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+ Can you also try:
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())' # my hypothesis is that it will return 24.
I don't know why but I always have to put the full path to binaries in srun
, otherwise I will obtain an error:
echo "Path to python:"
which python
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())'
Path to python: /somepath/anaconda-py3/2019.10/bin/python slurmstepd: error: execve(): python: No such file or directory
The output of
python_bin=$(which python)
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ${python_bin} -c 'import os; print(os.cpu_count())'
is 48 (I guess there are 24 cores, each has 2 threads). So your hypothesis is true, Python can access to all the CPUs.
From what I understand, here is what it seems like:
- You should submit a 1 node, 1 task
srun
with~/.local/bin/ray start --head
- You should NOT submit 1 node, 8 task
srun
- this is only relevant for something like pytorch DDP, which requires you to run the same script on each GPU. You do not need to do this for Ray.- You should perhaps reduce the CPU resource requirements for your Trial (in python script). It's actually just for accounting purposes and won't affect runtime.
Regarding (1) and (2): my conclusion as well (in case (2) Ray is started 8 times, in other words there are 8 Ray servers running at the same time).
Regarding (3): Do you mean that setting num_cpus_per_worker
in DistributedTrainableCreator
doesn't affect the training? That's strange... I thought num_cpus_per_worker
corresponds to the number of CPUs assigned to each PyTorch process (my PyTorch data loaders use 3 CPUs to process the data).
Another followup question:
#!/bin/bash #SBATCH --ntasks=2 # total number of tasks #SBATCH --ntasks-per-node=1 # number of tasks per node #SBATCH --gres=gpu:8 # number of GPUs per node #SBATCH --cpus-per-task=3 # number of cores per node <-------------- can you increase this? or is there a system limit?
Indeed, 3 is the limit of --cpus-per-task
. Using a higher value, I got sbatch: error: Batch job submission failed: Requested node configuration is not available
.
In conclusion, it seems that Ray Tune works correctly with the first configuration in my first post.
Sounds good. Maybe you now want to try using multiple nodes for your Ray training script?
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24
# See example for how to do this right
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --address $ip_head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24
@richardliaw Thanks. I adapted the tutorial in the master branch to my use case but haven't been able to get it working yet.
Consider the following scenario:
First step, allocate 3 nodes and charge necessary models:
#!/bin/bash
#SBATCH --ntasks=12 # total number of GPUs
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --gres=gpu:4 # number of GPUs per node
#SBATCH --cpus-per-task=10 # number of cores per node
module load python/3.7.5 cuda/10.1.2 cudnn/7.6.5.32-cuda-10.1 nccl/2.6.4-1-cuda gcc/7.3.0 openmpi/4.0.2-cuda
Second step, start Ray head node:
suffix='6379'
ip_head=`hostname`:$suffix
export ip_head # Exporting for latter access by trainer.py
# Start Ray head node
ray_bin=$(which ray)
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=`hostname` ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 40 &
sleep 5
Note: I set --cpus-per-task=1
because I thought the Ray server doesn't need to use a lot of CPUs, so it'd be better to save them for PyTorch, but please correct me if I'm wrong.)
Third step, start the two remaining Ray worker nodes:
# Start Ray worker nodes
srun --nodes=2 --ntasks=2 --ntasks-per-node=1 --cpus-per-task=1 --exclude=`hostname` ${ray_bin} start --address $ip_head --block --num-cpus 40 &
sleep 5
This command start a Ray server on each node.
Final step, launch Tune's Python script, each trial will use 2 GPUs and 10 CPUs:
OMP_NUM_THREADS=10 python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 10 --ngpus_per_trial 2
Result: it didn't work. The error was
Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'.
According to the logs (appended to the end of this post), all the 3 Ray servers were well started on 3 different nodes.
What did I do wrong?
Thank you again for your help!
2020-09-25 00:08:25,724 INFO scripts.py:465 -- Using IP address 10.159.40.14 for this node. 2020-09-25 00:08:25,726 INFO resource_spec.py:231 -- Starting Ray with 112.21 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 00:08:27,609 INFO services.py:1193 -- View the Ray dashboard at 10.159.40.14:8265 2020-09-25 00:08:27,852 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling ray start --address='10.159.40.14:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
Number of Ray workers (exclusing the head): 2 2020-09-25 00:08:29,098 INFO scripts.py:542 -- Using IP address 10.148.8.149 for this node. 2020-09-25 00:08:29,101 INFO resource_spec.py:231 -- Starting Ray with 121.53 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 00:08:29,139 INFO scripts.py:542 -- Using IP address 10.148.8.148 for this node. 2020-09-25 00:08:29,163 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run ray stop
2020-09-25 00:08:29,142 INFO resource_spec.py:231 -- Starting Ray with 121.53 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 00:08:29,196 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run ray stop
WARNING: Logging before InitGoogleLogging() is written to STDERR I0925 00:09:06.578213 31465 31465 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 00:09:06.579907 31465 31465 redis_client.cc:146] RedisClient connected. I0925 00:09:06.588017 31465 31465 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 00:09:06.589426 31465 31465 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:41451 I0925 00:09:06.589785 31465 31465 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 00:09:06.589805 31465 31465 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 00:09:06.589818 31465 31465 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 00:09:06.589831 31465 31465 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 00:09:06.589841 31465 31465 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 00:09:06.589854 31465 31465 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 00:09:06.589869 31465 31465 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': '78d43a5c8a3edb926e35128f9b0bb68e4217ccb1', 'Alive': True, 'NodeManagerAddress': '10.159.40.14', 'NodeManagerHostname': 'r11i4n5', 'NodeManagerPort': 51609, 'ObjectManagerPort': 37797, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'memory': 2298.0, 'node:10.159.40.14': 1.0, 'CPU': 120.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0}}, {'NodeID': 'ace3bd94d72a8e30eda055c6097d364fb5d5be65', 'Alive': True, 'NodeManagerAddress': '10.148.8.149', 'NodeManagerHostname': 'r11i4n7', 'NodeManagerPort': 54065, 'ObjectManagerPort': 44949, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'node:10.148.8.149': 1.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0}}, {'NodeID': '617df6676ae7618caea3db9db555a50df94cf865', 'Alive': True, 'NodeManagerAddress': '10.148.8.148', 'NodeManagerHostname': 'r11i4n6', 'NodeManagerPort': 43854, 'ObjectManagerPort': 39145, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.148': 1.0}}] (pid=32152, ip=10.159.40.64) F0925 00:09:06.604905 32152 32152 core_worker.cc:317] Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'. (pid=32152, ip=10.159.40.64) Check failure stack trace: (pid=32152, ip=10.159.40.64) @ 0x14c01a1a9edd google::LogMessage::Fail() ...
In a previous latest pip release's version of the tutorial, the worker nodes were started using a for loop. Maybe I can use that and start each node with a different port...
Some new strange behaviors...
I changed my script to start the worker nodes one by one using a for loop:
hostname=`hostname`
suffix='6379'
ip_head=${hostname}:$suffix
export ip_head # Exporting for latter access by trainer.py
ray_bin=$(which ray)
# Start Ray head node
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${hostname} ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 40 &
sleep 5
# Start Ray worker nodes: those different from the host
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
for (( i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
node=${nodes_array[$i]}
if [[ "${node}" != "${hostname}" ]] ;then
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} start --address $ip_head --block --num-cpus 40 &
sleep 5
fi
done
It seemed to work:
I0925 01:33:36.487109 35744 35744 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:33:36.489140 35744 35744 redis_client.cc:146] RedisClient connected. I0925 01:33:36.497191 35744 35744 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:33:36.498423 35744 35744 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:40417 I0925 01:33:36.498684 35744 35744 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:33:36.498700 35744 35744 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:33:36.498710 35744 35744 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:33:36.498719 35744 35744 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:33:36.498728 35744 35744 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:33:36.498736 35744 35744 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:33:36.498746 35744 35744 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': 'decb0e6e7c8d949a1b543e284677743cd96a8d59', 'Alive': True, 'NodeManagerAddress': '10.148.8.148', 'NodeManagerHostname': 'r11i4n6', 'NodeManagerPort': 64977, 'ObjectManagerPort': 37009, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.148': 1.0, 'object_store_memory': 736.0}}, {'NodeID': '9c663cff56c1085c6cb8768cc940afdd9a51f8a9', 'Alive': True, 'NodeManagerAddress': '10.159.40.14', 'NodeManagerHostname': 'r11i4n5', 'NodeManagerPort': 59819, 'ObjectManagerPort': 46265, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'node:10.159.40.14': 1.0, 'CPU': 120.0, 'object_store_memory': 735.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2298.0}}, {'NodeID': '11e5439845c7f9736bc71fdf7101eaf2b1a4c9e3', 'Alive': True, 'NodeManagerAddress': '10.148.8.149', 'NodeManagerHostname': 'r11i4n7', 'NodeManagerPort': 54733, 'ObjectManagerPort': 33525, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.149': 1.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0}}]
But then I had a typo in the Python code so it stopped. I fixed the typo, then launched it again, but it didn't work this time:
I0925 01:39:04.178894 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:04.180613 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:04.188684 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:04.189755 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:04.190002 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:04.190018 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:04.190027 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:04.190037 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:04.190044 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:04.190053 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:04.190063 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:04,191 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:05.191946 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:05.193334 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:05.193362 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:05.219499 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:05.219703 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:05.219717 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:05.219723 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:05.219730 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:05.219738 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:05.219744 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:05.219753 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:05,227 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:06.229421 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:06.231026 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:06.231053 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:06.231572 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:06.231770 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:06.231782 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:06.231791 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:06.231796 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:06.231803 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:06.231811 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:06.231820 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:06,232 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:07.234253 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:07.235419 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:07.235446 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:07.235951 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:07.236145 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:07.236157 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:07.236165 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:07.236171 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:07.236178 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:07.236186 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:07.236194 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:07,237 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:08.238590 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:08.242529 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:08.242556 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:08.243067 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:08.243264 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:08.243468 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:08.243477 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:08.243484 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:08.243491 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:08.243499 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:08.243507 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:08,244 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:09.245899 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:09.257231 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:09.257258 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:09.257776 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:09.257969 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:09.257982 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:09.257989 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:09.257995 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:09.258002 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:09.258009 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:09.258019 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Traceback (most recent call last): File "main_tune.py", line 316, in
main(args) File "main_tune.py", line 267, in main ray.init(address=os.environ["ip_head"]) File "/home/user/.local/lib/python3.7/site-packages/ray/worker.py", line 798, in init connect_only=True) File "/home/user/.local/lib/python3.7/site-packages/ray/node.py", line 155, in init redis_password=self.redis_password) File "/home/user/.local/lib/python3.7/site-packages/ray/services.py", line 212, in get_address_info_from_redis redis_address, node_ip_address, redis_password=redis_password) File "/home/user/.local/lib/python3.7/site-packages/ray/services.py", line 183, in get_address_info_from_redis_helper "Redis has started but no raylets have registered yet.") RuntimeError: Redis has started but no raylets have registered yet.
Arrrrrrr...
Hmmm maybe try doing ray stop &&
before ray start
to make sure there's no zombie processes from previous runs?
for (( i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
node=${nodes_array[$i]}
if [[ "${node}" != "${hostname}" ]] ;then
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop && ${ray_bin} start --address $ip_head --block --num-cpus 40 &
sleep 5
fi
done
@richardliaw Even worse :((
2020-09-25 02:54:50,475 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? 2020-09-25 02:54:51,115 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 02:54:51,116 ERROR scripts.py:563 -- raylet died with exit code -6 2020-09-25 02:54:51,116 ERROR scripts.py:565 -- Killing remaining processes and exiting... 2020-09-25 02:54:51,458 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 02:54:51,459 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 02:54:51,459 ERROR scripts.py:565 -- Killing remaining processes and exiting... I0925 02:54:51.477274 38565 38565 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 W0925 02:54:51.477411 38565 38565 redis_context.cc:305] Failed to connect to Redis, retrying. W0925 02:54:51.577510 38565 38565 redis_context.cc:305] Failed to connect to Redis, retrying.
~Is it possible that multiple worker tasks are being placed on the same node?~
Maybe also try ray stop --force
with that command. I think this is really close though.
@richardliaw Still same Ray processes died unexpectedly
error.
let "total_cores=${SLURM_NTASKS} * ${SLURM_CPUS_PER_TASK}"
echo "Total cores = ${total_cores}"
hostname=`hostname`
suffix='6379'
ip_head=${hostname}:$suffix
export ip_head # Exporting for latter access by trainer.py
ray_bin=$(which ray)
# Start Ray head node
echo "******************* Starting Ray head node on ${hostname}"
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${hostname} ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus ${total_cores} &
sleep 5
# Start Ray worker nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
for (( i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
node=${nodes_array[$i]}
if [[ "${node}" != "${hostname}" ]] ;then
port_min=$((6380 + i*10))
port_max=$((6380 + i*10 + 9))
echo "******************* Starting Ray worker node on ${node}, port_min = ${port_min}, port_max = ${port_max}"
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores} &
sleep 5
fi
done
The full logs are appended to the end of this message. Here are some highlights:
Starting Ray head node on r10i2n1 Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399 *** Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409 2020-09-25 16:06:02,880 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Full logs:
*** Starting Ray head node on r10i2n1 2020-09-25 16:04:43,819 INFO scripts.py:465 -- Using IP address 10.159.36.15 for this node. 2020-09-25 16:04:43,821 INFO resource_spec.py:231 -- Starting Ray with 112.11 GiB memory available for workers and up to 52.04 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 16:04:44,277 INFO services.py:1193 -- View the Ray dashboard at 10.159.36.15:8265 2020-09-25 16:04:44,476 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling ray start --address='10.159.36.15:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
Number of Ray workers (excluding the head): 2 *** Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399 2020-09-25 16:04:49,292 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node. 2020-09-25 16:04:49,298 INFO resource_spec.py:231 -- Starting Ray with 121.04 GiB memory available for workers and up to 51.89 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 16:04:49,323 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run ray stop
*** Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409 Number of PyTorch data workers: 4
- OMP_NUM_THREADS=4
python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 4 --ngpus_per_trial 2 2020-09-25 16:05:26,359 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node. 2020-09-25 16:05:26,374 INFO resource_spec.py:231 -- Starting Ray with 119.09 GiB memory available for workers and up to 51.05 GiB for objects. You can adjust these settings with ray.init(memory=
, object_store_memory= ). 2020-09-25 16:05:26,403 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run ray stop WARNING: Logging before InitGoogleLogging() is written to STDERR I0925 16:06:02.865236 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:02.869038 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:02.877063 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:02.878094 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:02.878341 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:02.878356 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:02.878367 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:02.878376 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:02.878383 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:02.878392 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:02.878402 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:02,880 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:03.882337 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:03.891283 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:03.891319 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:03.892272 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:03.892483 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:03.892496 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:03.892503 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:03.892510 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:03.892519 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:03.892526 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:03.892535 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:03,894 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:04.901077 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:04.902886 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:04.902923 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:04.903548 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:04.903774 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:04.903789 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:04.903797 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:04.903805 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:04.903812 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:04.903820 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:04.903829 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:04,905 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:05.906332 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:05.907965 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:05.908004 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:05.909677 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:05.909907 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:05.909921 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:05.909929 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:05.909935 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:05.909942 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:05.909950 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:05.909960 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:05,911 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? 2020-09-25 16:06:06,875 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:06,875 ERROR scripts.py:563 -- raylet died with exit code -6 2020-09-25 16:06:06,875 ERROR scripts.py:565 -- Killing remaining processes and exiting... I0925 16:06:06.913344 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:06.913882 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:06.913913 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. 2020-09-25 16:06:07,400 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:07,401 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 16:06:07,401 ERROR scripts.py:565 -- Killing remaining processes and exiting... 2020-09-25 16:06:07,444 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:07,445 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 16:06:07,445 ERROR scripts.py:565 -- Killing remaining processes and exiting... srun: error: r10i2n1: task 0: Exited with exit code 1 srun: Terminating job step 377044.0
I add a sleep 60
every time I do ray start
or ray stop
:
for (( i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
node=${nodes_array[$i]}
if [[ "${node}" != "${hostname}" ]] ;then
port_min=$((6380 + i*10))
port_max=$((6380 + i*10 + 9))
echo "******************* Starting Ray worker node on ${node}, port_min = ${port_min}, port_max = ${port_max}"
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && sleep 60 && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores} &
sleep 60
fi
done
sleep 60
and it seems to work!!
I obtained a strange warning though:
2020-09-25 16:31:04,448 WARNING worker.py:1134 -- The actor or task with ID ffffffffffffffff55c3b2b60100 is pending and cannot currently be scheduled. It requires {} for execution and {} for placement, but this node only has remaining {node:10.159.36.15: 1.000000}, {GPUType:V100: 1.000000}, {CPU: 120.000000}, {memory: 112.011719 GiB}, {GPU: 4.000000}, {object_store_memory: 35.839844 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
At some point I can see that there are 6 trials running at the same time, so it seems to work:
Resources requested: 48/360 CPUs, 12/12 GPUs, 0.0/352.98 GiB heap, 0.0/107.03 GiB objects (0/3.0 GPUType:V100)
Result logdir: /home/user/ray_results/resnet50_coco/WrappedDistributedTorchTrainable
Number of trials: 12 (6 PENDING, 6 RUNNING)
+----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+
| Trial name | status | loc | alpha | beta | bilateral_weight | blur | gamma | spatial_weight | steps | window | loss | accuracy | training_iteration |
|----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------|
| WrappedDistributedTorchTrainable_bdb17_00000 | RUNNING | | 600.387 | 0.494662 | 258.554 | 1 | 237.326 | 20.9554 | 3 | 9 | | | |
| WrappedDistributedTorchTrainable_bdb17_00001 | RUNNING | 10.148.3.221:68850 | 12.9375 | 0.00255001 | 248.646 | 4 | 3.42105 | 178.76 | 3 | 7 | 5.42031 | 0.205922 | 1 |
| WrappedDistributedTorchTrainable_bdb17_00002 | RUNNING | | 1.99624 | 0.846555 | 4.908 | 2 | 1.33116 | 907.723 | 3 | 7 | | | |
| WrappedDistributedTorchTrainable_bdb17_00003 | RUNNING | | 76.9 | 0.26823 | 9.33501 | 2 | 5.17456 | 101.455 | 2 | 11 | | | |
| WrappedDistributedTorchTrainable_bdb17_00004 | RUNNING | | 3.76462 | 0.0484779 | 35.6497 | 4 | 37.4868 | 516.052 | 3 | 11 | | | |
| WrappedDistributedTorchTrainable_bdb17_00005 | RUNNING | | 2.58193 | 0.037068 | 3.15383 | 1 | 924.28 | 1.62407 | 1 | 7 | | | |
| WrappedDistributedTorchTrainable_bdb17_00006 | PENDING | | 6.32752 | 0.0994885 | 3.13535 | 4 | 85.9414 | 14.1138 | 2 | 9 | | | |
| WrappedDistributedTorchTrainable_bdb17_00007 | PENDING | | 131.979 | 0.438311 | 3.27135 | 2 | 2.03457 | 1.63955 | 3 | 11 | | | |
| WrappedDistributedTorchTrainable_bdb17_00008 | PENDING | | 110.19 | 0.00681927 | 246.562 | 1 | 1.557 | 98.9305 | 1 | 9 | | | |
| WrappedDistributedTorchTrainable_bdb17_00009 | PENDING | | 24.706 | 0.00155841 | 335.158 | 1 | 15.5629 | 33.2957 | 3 | 7 | | | |
| WrappedDistributedTorchTrainable_bdb17_00010 | PENDING | | 1.38995 | 0.0813887 | 705.937 | 2 | 2.10977 | 806.93 | 3 | 7 | | | |
| WrappedDistributedTorchTrainable_bdb17_00011 | PENDING | | 2.05429 | 0.0680983 | 114.04 | 4 | 600.272 | 156.632 | 3 | 7 | | | |
+----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+
In addition to the above strange warning, I also observed some strange behaviors: the logs inside the trials are repeated 3 times! For example:
(pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes (pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes (pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes ... (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth Result for WrappedDistributedTorchTrainable_bdb17_00006: accuracy: 0.20715384009302748 date: 2020-09-25_16-34-26 done: true experiment_id: e9dfae8331b9413c9370a8dc434e7e92 experiment_tag: 6_alpha=6.3275,beta=0.099489,bilateral_weight=3.1353,blur=4,gamma=85.941,spatial_weight=14.114,steps=2,window=9 hostname: r10i2n1 iterations_since_restore: 2 loss: 5.397783092498779 node_ip: 10.148.3.221 pid: 70209 should_checkpoint: true time_since_restore: 131.63342452049255 time_this_iter_s: 63.13958144187927 time_total_s: 131.63342452049255 timestamp: 1601044466 timesteps_since_restore: 0 training_iteration: 2 trial_id: bdb17_00006
It seems that something, the same thing, is running on all the 3 Ray nodes...
Ah I noticed something really odd - notice that on two different worker nodes, they have the same IP address (10.148.3.221) - is this expected?
Number of Ray workers (excluding the head): 2
******************* Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399
2020-09-25 16:04:49,292 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node.
2020-09-25 16:04:49,298 INFO resource_spec.py:231 -- Starting Ray with 121.04 GiB memory available for workers and up to 51.89 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-09-25 16:04:49,323 INFO scripts.py:551 --
Started Ray on this node. If you wish to terminate the processes that have been started, run
ray stop
******************* Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409
Number of PyTorch data workers: 4
OMP_NUM_THREADS=4
python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 4 --ngpus_per_trial 2
2020-09-25 16:05:26,359 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node.
Ah I noticed something really odd - notice that on two different worker nodes, they have the same IP address (10.148.3.221) - is this expected?
I have no idea, but when starting the worker nodes, I set --address $ip_head
, so maybe the worker nodes take the same IP address as the head:
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && sleep 60 && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores}
The tutorial does the same thing so I guess that's expected?
Hm, --address $ip_head
tells the Ray "worker node" the location of the head node - this is how the cluster is joined together. However, each worker node should a unique IP address (which is printed in the logs).
@richardliaw You are right. I didn't notice that the IP address of the head was different than the workers. So here the issue is that all the worker nodes have the same IP address. I have no idea how this can happen.
One idea would just stick to using 1 worker node for now. Are you on the Ray slack? Happy to chat more there!
I've just tried again with 4 nodes, same issues:
Starting Ray head node on r13i2n1 2020-09-25 20:40:26,620 INFO scripts.py:465 -- Using IP address 10.159.48.11 for this node. Starting Ray worker node on r13i2n2, port_min = 6390, port_max = 6399 2020-09-25 20:42:27,101 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node. Starting Ray worker node on r13i2n3, port_min = 6400, port_max = 6409 2020-09-25 20:43:27,136 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node. Starting Ray worker node on r13i2n4, port_min = 6410, port_max = 6419 2020-09-25 20:44:27,153 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node.
One idea would just stick to using 1 worker node for now. Are you on the Ray slack? Happy to chat more there!
I'll ping you on Slack. Thanks!
Resolved offline!
Hello,
After a lot of effort, I've managed to get Ray Tune somehow working on a Slurm server for doing distributed hyper parameter search of my PyTorch model. However, I still have some doubts about what I did, and thus I'm not sure if it's working properly.
Each node of my server has 8 GPUs and 24 CPUs (i.e. 3 CPUs per GPU). I want each trial of Tune runs on 8 GPUs (i.e. an entire node), so I did something like this in my Python script:
Consider two cases:
sbatch tune.slurm
where mytune.slurm
script looks like this:This seems to work. I obtained some tuning results, but I'm not sure if the trials properly used the resources the way I want them to (i.e. each trial should use 8 GPUs and 24 CPUs). Usually I can connect to the running node using
srun --jobid=JOBID --pty bash
(and thenhtop
andnvidia-smi
to check the usage). However, when using Ray, this command doesn't work! I can no longer access to the node in interactive mode.The reason I think it might not work properly is that, when starting the Ray server, I assigned only 3 GPUs in total while I told Ray to take 24:
How can Ray have access to the 24 CPUs in this case? As a test, I changed it to 25 (which is larger than the total on the node):
We should expect an error, right? Nope, it still worked (it even showed
'CPU': 25.0
in the logs). Why? At this point, I don't trust the value in--num-cpus
anymore.Naturally, I thought: to make Ray use 24 CPUs, I just need to assign 24 CPUs to the
srun
command:Slurm complained because the maximum of
--cpus-per-task
is 3.Let's increase
ntasks
then:Slurm complained again:EDIT: that error was forsrun: Warning: can't run 1 processes on 8 nodes, setting nnodes to 1
.--nodes=8
. For the above command (srun --nodes=1 --ntasks=8
), it started Ray 8 times! I sawStarting Ray with 543.85 GiB memory available for workers and...
8 times and thenRuntimeError: Couldn't start Redis
andConnectionRefusedError: [Errno 111] Connection refused
. Thus, for starting Ray, it really has to be--ntasks=1
.The question is: how should I start the Ray server correctly? Should the following work?
Thank you very much in advance for your help!