How to properly deploying Ray Tune on a Slurm server for PyTorch hyper parameter search

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.54k stars 5.69k forks source link

How to properly deploying Ray Tune on a Slurm server for PyTorch hyper parameter search #10986

Closed netw0rkf10w closed 4 years ago

netw0rkf10w commented 4 years ago

Hello,

After a lot of effort, I've managed to get Ray Tune somehow working on a Slurm server for doing distributed hyper parameter search of my PyTorch model. However, I still have some doubts about what I did, and thus I'm not sure if it's working properly.

Each node of my server has 8 GPUs and 24 CPUs (i.e. 3 CPUs per GPU). I want each trial of Tune runs on 8 GPUs (i.e. an entire node), so I did something like this in my Python script:

    distributed_train = DistributedTrainableCreator(
        partial(train_function, args=args),
        use_gpu=True,
        num_workers=args.ngpus_per_trial,  # number of GPUs per trial
        num_cpus_per_worker=3,
        backend="nccl",
        timeout_s=60)

    result = tune.run(
        distributed_train,
        resources_per_trial=None,
        config=config,
        num_samples=args.num_trials,
        scheduler=scheduler,
        progress_reporter=reporter)

Consider two cases:

If I only want to use 1 node for the tuning task (i.e. 1 trial at a time), then I do sbatch tune.slurm where my tune.slurm script looks like this:

#!/bin/bash
#SBATCH --ntasks=8                   # total number of GPUs
#SBATCH --ntasks-per-node=8          # number of tasks per node
#SBATCH --gres=gpu:8                 # number of GPUs per node
#SBATCH --cpus-per-task=3            # number of cores per node

module purge
module load python/3.7.5 cuda/10.1.2 cudnn/7.6.5.32-cuda-10.1 nccl/2.6.4-1-cuda gcc/7.3.0 openmpi/4.0.2-cuda

# Start the head node
suffix='6379'
ip_head=`hostname`:$suffix
export ip_head # Exporting for latter access by trainer.py
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24 &
sleep 5

OMP_NUM_THREADS=3 python -u main_tune.py --cfg configs/resnet50_coco.cfg --ngpus_per_trial 8

This seems to work. I obtained some tuning results, but I'm not sure if the trials properly used the resources the way I want them to (i.e. each trial should use 8 GPUs and 24 CPUs). Usually I can connect to the running node using srun --jobid=JOBID --pty bash (and then htop and nvidia-smi to check the usage). However, when using Ray, this command doesn't work! I can no longer access to the node in interactive mode.

The reason I think it might not work properly is that, when starting the Ray server, I assigned only 3 GPUs in total while I told Ray to take 24:

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

How can Ray have access to the 24 CPUs in this case? As a test, I changed it to 25 (which is larger than the total on the node):

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 25

We should expect an error, right? Nope, it still worked (it even showed 'CPU': 25.0 in the logs). Why? At this point, I don't trust the value in --num-cpus anymore.

Naturally, I thought: to make Ray use 24 CPUs, I just need to assign 24 CPUs to the srun command:

srun --nodes=1 --ntasks=1 --cpus-per-task=24 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

Slurm complained because the maximum of --cpus-per-task is 3.

Let's increase ntasks then:

srun --nodes=1 --ntasks=8 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

~~Slurm complained again: srun: Warning: can't run 1 processes on 8 nodes, setting nnodes to 1.~~ EDIT: that error was for --nodes=8. For the above command (srun --nodes=1 --ntasks=8), it started Ray 8 times! I saw Starting Ray with 543.85 GiB memory available for workers and... 8 times and then RuntimeError: Couldn't start Redis and ConnectionRefusedError: [Errno 111] Connection refused. Thus, for starting Ray, it really has to be --ntasks=1.

If I want to use 2 nodes for the tuning task (i.e. 2 trials at a time), then I request them like this:

#!/bin/bash
#SBATCH --ntasks=16                   # total number of GPUs
#SBATCH --ntasks-per-node=8          # number of tasks per node
#SBATCH --gres=gpu:8                 # number of GPUs per node
#SBATCH --cpus-per-task=3            # number of cores per node

The question is: how should I start the Ray server correctly? Should the following work?

# Start the head node
suffix='6379'
ip_head=`hostname`:$suffix
export ip_head # Exporting for latter access by trainer.py
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24 &
sleep 5

# Start the worker nodes
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --exclude=`hostname` ~/.local/bin/ray start --address $ip_head --block --num-cpus 24 &
sleep 5

OMP_NUM_THREADS=3 python -u main_tune.py --cfg configs/resnet50_coco.cfg --ngpus_per_trial 8

Thank you very much in advance for your help!

richardliaw commented 4 years ago

cc @sumanthratna could you help take a look?

richardliaw commented 4 years ago

@netw0rkf10w thanks for this long write up! I'll try to answer a couple:

Not sure why interactive mode doesn't work
--num-cpus is only accounting/an override value. It doesn't necessarily correspond with physical CPUs. You should set it to --num-cpus=3 which corresponds with --cpus-per-task=3 (assuming each machine in your cluster only has 3 CPUs).
srun --nodes=1 --ntasks=8 I'm not quite sure what this does, but perhaps you should increase number of nodes too.

Have you looked at https://docs.ray.io/en/master/cluster/slurm.html?

netw0rkf10w commented 4 years ago

@richardliaw Thanks a lot for the prompt reply!

--num-cpus is only accounting/an override value. It doesn't necessarily correspond with physical CPUs. You should set it to --num-cpus=3 which corresponds with --cpus-per-task=3 (assuming each machine in your cluster only has 3 CPUs).

I already tried --num-cpus=3 but then Tune had access to only 3 CPUs, which raised a resource error because for each trial I requested 8GPUs and 24 CPUs:

distributed_train = DistributedTrainableCreator(
        partial(train_function, args=args),
        use_gpu=True,
        num_workers=8,  # number of GPUs per trial
        num_cpus_per_worker=3,
        backend="nccl",
        timeout_s=60)

srun --nodes=1 --ntasks=8 I'm not quite sure what this does, but perhaps you should increase number of nodes too.

Well, on my Slurm server, I was instructed that --ntasks is the total number of GPUs requested (thus: --ntasks=8 means 1 node and --ntasks=16 means 2 nodes).

Have you looked at https://docs.ray.io/en/master/cluster/slurm.html?

Actually that was the first thing I looked at, and it took me a while to adapt the example to my use case. In that example, each node has 1 task, but in my case --ntasks corresponds to the number of GPUs (i.e. 8 per node). At least this is what I have been doing for PyTorch distributed trainings (8 GPUs per node, so 8 tasks per node). Let me try packing the 8 GPUs into a single to task to see if it works...

netw0rkf10w commented 4 years ago

I tried 1 single task with 24 CPUs:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=24

...

# Start the worker nodes
srun --nodes=1 --ntasks=1 --cpus-per-task=24 --exclude=`hostname` ~/.local/bin/ray start --address $ip_head --block --num-cpus 24 &
sleep 5

but it didn't work:

sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available

netw0rkf10w commented 4 years ago

A clarification just in case: I managed to get Tune working on a single node, except that I'm not sure if the trials used all the 24 CPUs available on the node, or if they shared the same 3 CPUs between them because I launched Ray using

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

(It's late in Europe, I'll get back in 8-10 hours. Thanks.)

richardliaw commented 4 years ago

@netw0rkf10w can you post the output of the above Tune run with:

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

Can you also try:

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())'
# my hypothesis is that it will return 24.

From what I understand, here is what it seems like:

You should submit a 1 node, 1 task srun with ~/.local/bin/ray start --head
You should NOT submit 1 node, 8 task srun - this is only relevant for something like pytorch DDP, which requires you to run the same script on each GPU. You do not need to do this for Ray.
You should perhaps reduce the CPU resource requirements for your Trial (in python script). It's actually just for accounting purposes and won't affect runtime.

Another followup question:

#!/bin/bash
#SBATCH --ntasks=2                   # total number of tasks
#SBATCH --ntasks-per-node=1          # number of tasks per node
#SBATCH --gres=gpu:8                 # number of GPUs per node
#SBATCH --cpus-per-task=3            # number of cores per node <-------------- can you increase this? or is there a system limit?

netw0rkf10w commented 4 years ago

Hi @richardliaw !

@netw0rkf10w can you post the output of the above Tune run with:

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

The logs are shown below where I set 1 GPU for each trial (the logs is for --num-cpus 25 but for --num-cpus 24 they are similar):

2020-09-24 01:41:45,662 INFO scripts.py:465 -- Using IP address 10.148.8.10 for this node. 2020-09-24 01:41:45,663 INFO resource_spec.py:231 -- Starting Ray with 544.19 GiB memory available for workers and up to 186.26 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-24 01:41:46,343 INFO services.py:1193 -- View the Ray dashboard at 10.148.8.10:8265 2020-09-24 01:41:46,399 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='10.148.8.10:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
WARNING: Logging before InitGoogleLogging() is written to STDERR I0924 01:41:52.687922 8548 8548 global_state_accessor.cc:25] Redis server address = 10.148.8.10:6379, is test flag = 0 I0924 01:41:52.698506 8548 8548 redis_client.cc:146] RedisClient connected. I0924 01:41:52.706383 8548 8548 redis_gcs_client.cc:89] RedisGcsClient Connected. I0924 01:41:52.707314 8548 8548 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.8.10:34015 I0924 01:41:52.707515 8548 8548 service_based_accessor.cc:92] Reestablishing subscription for job info. I0924 01:41:52.707527 8548 8548 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0924 01:41:52.707535 8548 8548 service_based_accessor.cc:797] Reestablishing subscription for node info. I0924 01:41:52.707541 8548 8548 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0924 01:41:52.707547 8548 8548 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0924 01:41:52.707553 8548 8548 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0924 01:41:52.707561 8548 8548 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': 'd74aff9bad58c110345691f5b9697174cc2372d6', 'Alive': True, 'NodeManagerAddress': '10.148.8.10', 'NodeManagerHostname': 'node1', 'NodeManagerPort': 61334, 'ObjectManagerPort': 38197, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-24_01-41-45_662966_8535/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-24_01-41-45_662966_8535/sockets/raylet', 'alive': True, 'Resources': {'GPUType:V100': 1.0, 'memory': 11145.0, 'CPU': 25.0, 'node:10.148.8.10': 1.0, 'object_store_memory': 2632.0, 'GPU': 8.0}}] == Status == Memory usage on this node: 24.4/754.3 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 2.000: None | Iter 1.000: None Resources requested: 3/25 CPUs, 1/8 GPUs, 0.0/544.19 GiB heap, 0.0/128.52 GiB objects (0/1.0 GPUType:V100) Result logdir: /ray_results/WrappedDistributedTorchTrainable Number of trials: 16 (15 PENDING, 1 RUNNING) +----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+ | Trial name | status | loc | alpha | beta | bilateral_weight | blur | gamma | spatial_weight | steps | window | |----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------| | WrappedDistributedTorchTrainable_5b88f_00000 | RUNNING | | 1.00825 | 0.0102361 | 260.833 | 2 | 33.811 | 20.5276 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00001 | PENDING | | 50.2677 | 0.108875 | 282.037 | 4 | 240.289 | 205.205 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00002 | PENDING | | 333.117 | 0.0323349 | 9.59502 | 1 | 198.175 | 600.021 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00003 | PENDING | | 4.48639 | 0.215188 | 14.553 | 1 | 49.9334 | 9.37293 | 2 | 9 | | WrappedDistributedTorchTrainable_5b88f_00004 | PENDING | | 43.2839 | 0.0010359 | 218.981 | 4 | 73.9295 | 1.58913 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00005 | PENDING | | 148.056 | 0.369112 | 14.5907 | 1 | 46.9408 | 2.32398 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00006 | PENDING | | 24.3175 | 0.724971 | 24.9273 | 1 | 45.138 | 1.21104 | 1 | 9 | | WrappedDistributedTorchTrainable_5b88f_00007 | PENDING | | 5.86012 | 0.00181386 | 7.42448 | 1 | 137.955 | 87.5108 | 1 | 11 | | WrappedDistributedTorchTrainable_5b88f_00008 | PENDING | | 1.20372 | 0.00784292 | 108.812 | 2 | 422.599 | 27.7823 | 3 | 9 | | WrappedDistributedTorchTrainable_5b88f_00009 | PENDING | | 8.98295 | 0.00280868 | 10.5628 | 4 | 16.9054 | 116.381 | 3 | 11 | | WrappedDistributedTorchTrainable_5b88f_00010 | PENDING | | 59.0022 | 0.312436 | 32.4772 | 4 | 5.07771 | 6.54218 | 3 | 9 | | WrappedDistributedTorchTrainable_5b88f_00011 | PENDING | | 50.4253 | 0.00763172 | 6.23578 | 4 | 293.568 | 136.231 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00012 | PENDING | | 35.6073 | 0.186986 | 101.348 | 2 | 2.25512 | 5.87278 | 1 | 11 | | WrappedDistributedTorchTrainable_5b88f_00013 | PENDING | | 763.593 | 0.856251 | 319.073 | 4 | 1.27139 | 2.50655 | 3 | 7 | | WrappedDistributedTorchTrainable_5b88f_00014 | PENDING | | 22.8901 | 0.00167951 | 327.988 | 1 | 399.009 | 57.7956 | 2 | 11 | | WrappedDistributedTorchTrainable_5b88f_00015 | PENDING | | 1.41336 | 0.0258321 | 1.33932 | 1 | 143.696 | 137.824 | 2 | 11 | +----------------------------------------------+----------+-------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+ Saving checkpoint to /ray_results/WrappedDistributedTorchTrainable/WrappedDistributedTorchTrainable_6_alpha=24.318,beta=0.72497,bilateral_weight=24.927,blur=1,gamma=45.138,spatial_weight=1.211,step_2020-09-24_01-41-53nnki37bt/worker_0/checkpoint_0/checkpoint.pth Result for WrappedDistributedTorchTrainable_5b88f_00006: accuracy: 0.2075007382104213 date: 2020-09-24_01-45-42 done: false experiment_id: 1555c7fa0aba4af1a036bcc6d927a2fa experiment_tag: 6_alpha=24.318,beta=0.72497,bilateral_weight=24.927,blur=1,gamma=45.138,spatial_weight=1.211,steps=1,window=9 hostname: jean-zay-ia810 iterations_since_restore: 1 loss: 5.2999533758163455 node_ip: 10.148.8.10 pid: 8676 should_checkpoint: true time_since_restore: 218.52256321907043 time_this_iter_s: 218.52256321907043 time_total_s: 218.52256321907043 timestamp: 1600904742 timesteps_since_restore: 0 training_iteration: 1 trial_id: 5b88f_00006

== Status == Memory usage on this node: 48.9/754.3 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 2.000: None | Iter 1.000: 0.2075007382104213 Resources requested: 24/25 CPUs, 8/8 GPUs, 0.0/544.19 GiB heap, 0.0/128.52 GiB objects (0/1.0 GPUType:V100) Result logdir: /ray_results/WrappedDistributedTorchTrainable Number of trials: 16 (8 PENDING, 8 RUNNING) +----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+ | Trial name | status | loc | alpha | beta | bilateral_weight | blur | gamma | spatial_weight | steps | window | loss | accuracy | training_iteration | |----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------| | WrappedDistributedTorchTrainable_5b88f_00000 | RUNNING | | 1.00825 | 0.0102361 | 260.833 | 2 | 33.811 | 20.5276 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00001 | RUNNING | | 50.2677 | 0.108875 | 282.037 | 4 | 240.289 | 205.205 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00002 | RUNNING | | 333.117 | 0.0323349 | 9.59502 | 1 | 198.175 | 600.021 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00003 | RUNNING | | 4.48639 | 0.215188 | 14.553 | 1 | 49.9334 | 9.37293 | 2 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00004 | RUNNING | | 43.2839 | 0.0010359 | 218.981 | 4 | 73.9295 | 1.58913 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00005 | RUNNING | | 148.056 | 0.369112 | 14.5907 | 1 | 46.9408 | 2.32398 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00006 | RUNNING | 10.148.8.10:8676 | 24.3175 | 0.724971 | 24.9273 | 1 | 45.138 | 1.21104 | 1 | 9 | 5.29995 | 0.207501 | 1 | | WrappedDistributedTorchTrainable_5b88f_00007 | RUNNING | | 5.86012 | 0.00181386 | 7.42448 | 1 | 137.955 | 87.5108 | 1 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00008 | PENDING | | 1.20372 | 0.00784292 | 108.812 | 2 | 422.599 | 27.7823 | 3 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00009 | PENDING | | 8.98295 | 0.00280868 | 10.5628 | 4 | 16.9054 | 116.381 | 3 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00010 | PENDING | | 59.0022 | 0.312436 | 32.4772 | 4 | 5.07771 | 6.54218 | 3 | 9 | | | | | WrappedDistributedTorchTrainable_5b88f_00011 | PENDING | | 50.4253 | 0.00763172 | 6.23578 | 4 | 293.568 | 136.231 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00012 | PENDING | | 35.6073 | 0.186986 | 101.348 | 2 | 2.25512 | 5.87278 | 1 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00013 | PENDING | | 763.593 | 0.856251 | 319.073 | 4 | 1.27139 | 2.50655 | 3 | 7 | | | | | WrappedDistributedTorchTrainable_5b88f_00014 | PENDING | | 22.8901 | 0.00167951 | 327.988 | 1 | 399.009 | 57.7956 | 2 | 11 | | | | | WrappedDistributedTorchTrainable_5b88f_00015 | PENDING | | 1.41336 | 0.0258321 | 1.33932 | 1 | 143.696 | 137.824 | 2 | 11 | | | | +----------------------------------------------+----------+------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+

Can you also try:
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())'
# my hypothesis is that it will return 24.

I don't know why but I always have to put the full path to binaries in srun, otherwise I will obtain an error:

echo "Path to python:"
which python
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` python -c 'import os; print(os.cpu_count())'

Path to python: /somepath/anaconda-py3/2019.10/bin/python slurmstepd: error: execve(): python: No such file or directory

The output of

python_bin=$(which python)
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ${python_bin} -c 'import os; print(os.cpu_count())'

is 48 (I guess there are 24 cores, each has 2 threads). So your hypothesis is true, Python can access to all the CPUs.

From what I understand, here is what it seems like:

You should submit a 1 node, 1 task srun with ~/.local/bin/ray start --head

You should NOT submit 1 node, 8 task srun - this is only relevant for something like pytorch DDP, which requires you to run the same script on each GPU. You do not need to do this for Ray.

You should perhaps reduce the CPU resource requirements for your Trial (in python script). It's actually just for accounting purposes and won't affect runtime.

Regarding (1) and (2): my conclusion as well (in case (2) Ray is started 8 times, in other words there are 8 Ray servers running at the same time).

Regarding (3): Do you mean that setting num_cpus_per_worker in DistributedTrainableCreator doesn't affect the training? That's strange... I thought num_cpus_per_worker corresponds to the number of CPUs assigned to each PyTorch process (my PyTorch data loaders use 3 CPUs to process the data).

Another followup question:

#!/bin/bash
#SBATCH --ntasks=2                   # total number of tasks
#SBATCH --ntasks-per-node=1          # number of tasks per node
#SBATCH --gres=gpu:8                 # number of GPUs per node
#SBATCH --cpus-per-task=3            # number of cores per node <-------------- can you increase this? or is there a system limit?

Indeed, 3 is the limit of --cpus-per-task. Using a higher value, I got sbatch: error: Batch job submission failed: Requested node configuration is not available.

In conclusion, it seems that Ray Tune works correctly with the first configuration in my first post.

richardliaw commented 4 years ago

Sounds good. Maybe you now want to try using multiple nodes for your Ray training script?

srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

# See example for how to do this right
srun --nodes=1 --ntasks=1 --cpus-per-task=3 --nodelist=`hostname` ~/.local/bin/ray start --address  $ip_head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 24

netw0rkf10w commented 4 years ago

@richardliaw Thanks. I adapted the tutorial in the master branch to my use case but haven't been able to get it working yet.

Consider the following scenario:

Each node of my server has 4 GPUs and 40 CPUs (sorry for the change, the nodes with 8 GPUs and 24 CPUs are all allocated so I can only perform experiments on 4-GPU nodes).
I want to use 3 nodes (i.e. 12 GPUs in total) to perform 6 trials with Tune, where each trial is given 2 GPUs. (This means that the 6 trials can run at the same time in parallel.)

First step, allocate 3 nodes and charge necessary models:

#!/bin/bash
#SBATCH --ntasks=12                   # total number of GPUs
#SBATCH --ntasks-per-node=4           # number of tasks per node
#SBATCH --gres=gpu:4                  # number of GPUs per node
#SBATCH --cpus-per-task=10            # number of cores per node

module load python/3.7.5 cuda/10.1.2 cudnn/7.6.5.32-cuda-10.1 nccl/2.6.4-1-cuda gcc/7.3.0 openmpi/4.0.2-cuda

Second step, start Ray head node:

suffix='6379'
ip_head=`hostname`:$suffix
export ip_head # Exporting for latter access by trainer.py

# Start Ray head node
ray_bin=$(which ray)
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=`hostname` ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 40 &
sleep 5

Note: I set --cpus-per-task=1 because I thought the Ray server doesn't need to use a lot of CPUs, so it'd be better to save them for PyTorch, but please correct me if I'm wrong.)

Third step, start the two remaining Ray worker nodes:

# Start Ray worker nodes
srun --nodes=2 --ntasks=2 --ntasks-per-node=1 --cpus-per-task=1 --exclude=`hostname` ${ray_bin} start --address $ip_head --block --num-cpus 40 &
sleep 5

This command start a Ray server on each node.

Final step, launch Tune's Python script, each trial will use 2 GPUs and 10 CPUs:

OMP_NUM_THREADS=10 python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 10 --ngpus_per_trial 2

Result: it didn't work. The error was

Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'.

According to the logs (appended to the end of this post), all the 3 Ray servers were well started on 3 different nodes.

What did I do wrong?

Thank you again for your help!

2020-09-25 00:08:25,724 INFO scripts.py:465 -- Using IP address 10.159.40.14 for this node. 2020-09-25 00:08:25,726 INFO resource_spec.py:231 -- Starting Ray with 112.21 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 00:08:27,609 INFO services.py:1193 -- View the Ray dashboard at 10.159.40.14:8265 2020-09-25 00:08:27,852 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='10.159.40.14:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
Number of Ray workers (exclusing the head): 2 2020-09-25 00:08:29,098 INFO scripts.py:542 -- Using IP address 10.148.8.149 for this node. 2020-09-25 00:08:29,101 INFO resource_spec.py:231 -- Starting Ray with 121.53 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 00:08:29,139 INFO scripts.py:542 -- Using IP address 10.148.8.148 for this node. 2020-09-25 00:08:29,163 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run
ray stop
2020-09-25 00:08:29,142 INFO resource_spec.py:231 -- Starting Ray with 121.53 GiB memory available for workers and up to 52.1 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 00:08:29,196 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run
ray stop
WARNING: Logging before InitGoogleLogging() is written to STDERR I0925 00:09:06.578213 31465 31465 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 00:09:06.579907 31465 31465 redis_client.cc:146] RedisClient connected. I0925 00:09:06.588017 31465 31465 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 00:09:06.589426 31465 31465 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:41451 I0925 00:09:06.589785 31465 31465 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 00:09:06.589805 31465 31465 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 00:09:06.589818 31465 31465 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 00:09:06.589831 31465 31465 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 00:09:06.589841 31465 31465 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 00:09:06.589854 31465 31465 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 00:09:06.589869 31465 31465 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': '78d43a5c8a3edb926e35128f9b0bb68e4217ccb1', 'Alive': True, 'NodeManagerAddress': '10.159.40.14', 'NodeManagerHostname': 'r11i4n5', 'NodeManagerPort': 51609, 'ObjectManagerPort': 37797, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'memory': 2298.0, 'node:10.159.40.14': 1.0, 'CPU': 120.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0}}, {'NodeID': 'ace3bd94d72a8e30eda055c6097d364fb5d5be65', 'Alive': True, 'NodeManagerAddress': '10.148.8.149', 'NodeManagerHostname': 'r11i4n7', 'NodeManagerPort': 54065, 'ObjectManagerPort': 44949, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'node:10.148.8.149': 1.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0}}, {'NodeID': '617df6676ae7618caea3db9db555a50df94cf865', 'Alive': True, 'NodeManagerAddress': '10.148.8.148', 'NodeManagerHostname': 'r11i4n6', 'NodeManagerPort': 43854, 'ObjectManagerPort': 39145, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_00-08-25_725507_31308/sockets/raylet', 'alive': True, 'Resources': {'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.148': 1.0}}] (pid=32152, ip=10.159.40.64) F0925 00:09:06.604905 32152 32152 core_worker.cc:317] Check failed: assigned_port != -1 Failed to allocate a port for the worker. Please specify a wider port range using the '--min-worker-port' and '--max-worker-port' arguments to 'ray start'. (pid=32152, ip=10.159.40.64) Check failure stack trace: (pid=32152, ip=10.159.40.64) @ 0x14c01a1a9edd google::LogMessage::Fail() ...

netw0rkf10w commented 4 years ago

In ~~a previous~~ latest pip release's version of the tutorial, the worker nodes were started using a for loop. Maybe I can use that and start each node with a different port...

netw0rkf10w commented 4 years ago

Some new strange behaviors...

I changed my script to start the worker nodes one by one using a for loop:

hostname=`hostname`
suffix='6379'
ip_head=${hostname}:$suffix
export ip_head # Exporting for latter access by trainer.py

ray_bin=$(which ray)

# Start Ray head node
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${hostname} ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus 40 &
sleep 5

# Start Ray worker nodes: those different from the host
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
for ((  i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
  node=${nodes_array[$i]}
  if [[ "${node}" != "${hostname}" ]] ;then
    srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} start --address $ip_head --block --num-cpus 40 &
    sleep 5
  fi
done

It seemed to work:

I0925 01:33:36.487109 35744 35744 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:33:36.489140 35744 35744 redis_client.cc:146] RedisClient connected. I0925 01:33:36.497191 35744 35744 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:33:36.498423 35744 35744 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:40417 I0925 01:33:36.498684 35744 35744 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:33:36.498700 35744 35744 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:33:36.498710 35744 35744 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:33:36.498719 35744 35744 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:33:36.498728 35744 35744 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:33:36.498736 35744 35744 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:33:36.498746 35744 35744 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Nodes in the Ray cluster: [{'NodeID': 'decb0e6e7c8d949a1b543e284677743cd96a8d59', 'Alive': True, 'NodeManagerAddress': '10.148.8.148', 'NodeManagerHostname': 'r11i4n6', 'NodeManagerPort': 64977, 'ObjectManagerPort': 37009, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.148': 1.0, 'object_store_memory': 736.0}}, {'NodeID': '9c663cff56c1085c6cb8768cc940afdd9a51f8a9', 'Alive': True, 'NodeManagerAddress': '10.159.40.14', 'NodeManagerHostname': 'r11i4n5', 'NodeManagerPort': 59819, 'ObjectManagerPort': 46265, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'node:10.159.40.14': 1.0, 'CPU': 120.0, 'object_store_memory': 735.0, 'GPU': 4.0, 'GPUType:V100': 1.0, 'memory': 2298.0}}, {'NodeID': '11e5439845c7f9736bc71fdf7101eaf2b1a4c9e3', 'Alive': True, 'NodeManagerAddress': '10.148.8.149', 'NodeManagerHostname': 'r11i4n7', 'NodeManagerPort': 54733, 'ObjectManagerPort': 33525, 'ObjectStoreSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-09-25_01-32-20_342200_35514/sockets/raylet', 'alive': True, 'Resources': {'memory': 2489.0, 'CPU': 120.0, 'node:10.148.8.149': 1.0, 'object_store_memory': 736.0, 'GPU': 4.0, 'GPUType:V100': 1.0}}]

But then I had a typo in the Python code so it stopped. I fixed the typo, then launched it again, but it didn't work this time:

I0925 01:39:04.178894 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:04.180613 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:04.188684 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:04.189755 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:04.190002 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:04.190018 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:04.190027 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:04.190037 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:04.190044 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:04.190053 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:04.190063 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:04,191 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:05.191946 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:05.193334 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:05.193362 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:05.219499 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:05.219703 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:05.219717 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:05.219723 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:05.219730 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:05.219738 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:05.219744 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:05.219753 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:05,227 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:06.229421 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:06.231026 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:06.231053 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:06.231572 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:06.231770 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:06.231782 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:06.231791 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:06.231796 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:06.231803 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:06.231811 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:06.231820 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:06,232 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:07.234253 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:07.235419 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:07.235446 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:07.235951 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:07.236145 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:07.236157 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:07.236165 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:07.236171 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:07.236178 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:07.236186 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:07.236194 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:07,237 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:08.238590 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:08.242529 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:08.242556 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:08.243067 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:08.243264 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:08.243468 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:08.243477 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:08.243484 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:08.243491 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:08.243499 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:08.243507 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 01:39:08,244 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 01:39:09.245899 37452 37452 global_state_accessor.cc:25] Redis server address = 10.148.6.144:6379, is test flag = 0 I0925 01:39:09.257231 37452 37452 redis_client.cc:146] RedisClient connected. I0925 01:39:09.257258 37452 37452 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 01:39:09.257776 37452 37452 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.6.144:45047 I0925 01:39:09.257969 37452 37452 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 01:39:09.257982 37452 37452 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 01:39:09.257989 37452 37452 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 01:39:09.257995 37452 37452 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 01:39:09.258002 37452 37452 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 01:39:09.258009 37452 37452 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 01:39:09.258019 37452 37452 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. Traceback (most recent call last): File "main_tune.py", line 316, in main(args) File "main_tune.py", line 267, in main ray.init(address=os.environ["ip_head"]) File "/home/user/.local/lib/python3.7/site-packages/ray/worker.py", line 798, in init connect_only=True) File "/home/user/.local/lib/python3.7/site-packages/ray/node.py", line 155, in init redis_password=self.redis_password) File "/home/user/.local/lib/python3.7/site-packages/ray/services.py", line 212, in get_address_info_from_redis redis_address, node_ip_address, redis_password=redis_password) File "/home/user/.local/lib/python3.7/site-packages/ray/services.py", line 183, in get_address_info_from_redis_helper "Redis has started but no raylets have registered yet.") RuntimeError: Redis has started but no raylets have registered yet.

Arrrrrrr...

richardliaw commented 4 years ago

Hmmm maybe try doing ray stop && before ray start to make sure there's no zombie processes from previous runs?


for ((  i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
  node=${nodes_array[$i]}
  if [[ "${node}" != "${hostname}" ]] ;then
    srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop && ${ray_bin} start --address $ip_head --block --num-cpus 40 &
    sleep 5
  fi
done

netw0rkf10w commented 4 years ago

@richardliaw Even worse :((

2020-09-25 02:54:50,475 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? 2020-09-25 02:54:51,115 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 02:54:51,116 ERROR scripts.py:563 -- raylet died with exit code -6 2020-09-25 02:54:51,116 ERROR scripts.py:565 -- Killing remaining processes and exiting... 2020-09-25 02:54:51,458 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 02:54:51,459 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 02:54:51,459 ERROR scripts.py:565 -- Killing remaining processes and exiting... I0925 02:54:51.477274 38565 38565 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 W0925 02:54:51.477411 38565 38565 redis_context.cc:305] Failed to connect to Redis, retrying. W0925 02:54:51.577510 38565 38565 redis_context.cc:305] Failed to connect to Redis, retrying.

richardliaw commented 4 years ago

~Is it possible that multiple worker tasks are being placed on the same node?~

Maybe also try ray stop --force with that command. I think this is really close though.

netw0rkf10w commented 4 years ago

@richardliaw Still same Ray processes died unexpectedly error.

let "total_cores=${SLURM_NTASKS} * ${SLURM_CPUS_PER_TASK}"
echo "Total cores = ${total_cores}"

hostname=`hostname`
suffix='6379'
ip_head=${hostname}:$suffix
export ip_head # Exporting for latter access by trainer.py

ray_bin=$(which ray)

# Start Ray head node
echo "******************* Starting Ray head node on ${hostname}"
srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${hostname} ${ray_bin} start --head --block --dashboard-host 0.0.0.0 --port=6379 --num-cpus ${total_cores} &
sleep 5

# Start Ray worker nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

for ((  i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
  node=${nodes_array[$i]}
  if [[ "${node}" != "${hostname}" ]] ;then
    port_min=$((6380 + i*10))
    port_max=$((6380 + i*10 + 9))
    echo "******************* Starting Ray worker node on ${node}, port_min = ${port_min}, port_max = ${port_max}"
    srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores} &
    sleep 5
  fi
done

The full logs are appended to the end of this message. Here are some highlights:

Starting Ray head node on r10i2n1 Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399 *** Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409 2020-09-25 16:06:02,880 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

Full logs:

*** Starting Ray head node on r10i2n1 2020-09-25 16:04:43,819 INFO scripts.py:465 -- Using IP address 10.159.36.15 for this node. 2020-09-25 16:04:43,821 INFO resource_spec.py:231 -- Starting Ray with 112.11 GiB memory available for workers and up to 52.04 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 16:04:44,277 INFO services.py:1193 -- View the Ray dashboard at 10.159.36.15:8265 2020-09-25 16:04:44,476 INFO scripts.py:495 -- Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='10.159.36.15:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
Number of Ray workers (excluding the head): 2 *** Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399 2020-09-25 16:04:49,292 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node. 2020-09-25 16:04:49,298 INFO resource_spec.py:231 -- Starting Ray with 121.04 GiB memory available for workers and up to 51.89 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 16:04:49,323 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run
ray stop
*** Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409 Number of PyTorch data workers: 4

OMP_NUM_THREADS=4

python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 4 --ngpus_per_trial 2 2020-09-25 16:05:26,359 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node. 2020-09-25 16:05:26,374 INFO resource_spec.py:231 -- Starting Ray with 119.09 GiB memory available for workers and up to 51.05 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-09-25 16:05:26,403 INFO scripts.py:551 -- Started Ray on this node. If you wish to terminate the processes that have been started, run

ray stop WARNING: Logging before InitGoogleLogging() is written to STDERR I0925 16:06:02.865236 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:02.869038 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:02.877063 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:02.878094 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:02.878341 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:02.878356 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:02.878367 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:02.878376 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:02.878383 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:02.878392 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:02.878402 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:02,880 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:03.882337 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:03.891283 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:03.891319 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:03.892272 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:03.892483 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:03.892496 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:03.892503 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:03.892510 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:03.892519 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:03.892526 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:03.892535 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:03,894 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:04.901077 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:04.902886 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:04.902923 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:04.903548 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:04.903774 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:04.903789 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:04.903797 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:04.903805 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:04.903812 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:04.903820 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:04.903829 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:04,905 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? I0925 16:06:05.906332 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:05.907965 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:05.908004 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. I0925 16:06:05.909677 53397 53397 service_based_gcs_client.cc:193] Reconnected to GCS server: 10.148.3.221:42283 I0925 16:06:05.909907 53397 53397 service_based_accessor.cc:92] Reestablishing subscription for job info. I0925 16:06:05.909921 53397 53397 service_based_accessor.cc:422] Reestablishing subscription for actor info. I0925 16:06:05.909929 53397 53397 service_based_accessor.cc:797] Reestablishing subscription for node info. I0925 16:06:05.909935 53397 53397 service_based_accessor.cc:1073] Reestablishing subscription for task info. I0925 16:06:05.909942 53397 53397 service_based_accessor.cc:1248] Reestablishing subscription for object locations. I0925 16:06:05.909950 53397 53397 service_based_accessor.cc:1368] Reestablishing subscription for worker failures. I0925 16:06:05.909960 53397 53397 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected. 2020-09-25 16:06:05,911 WARNING services.py:219 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node? 2020-09-25 16:06:06,875 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:06,875 ERROR scripts.py:563 -- raylet died with exit code -6 2020-09-25 16:06:06,875 ERROR scripts.py:565 -- Killing remaining processes and exiting... I0925 16:06:06.913344 53397 53397 global_state_accessor.cc:25] Redis server address = 10.148.3.221:6379, is test flag = 0 I0925 16:06:06.913882 53397 53397 redis_client.cc:146] RedisClient connected. I0925 16:06:06.913913 53397 53397 redis_gcs_client.cc:89] RedisGcsClient Connected. 2020-09-25 16:06:07,400 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:07,401 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 16:06:07,401 ERROR scripts.py:565 -- Killing remaining processes and exiting... 2020-09-25 16:06:07,444 ERROR scripts.py:560 -- Ray processes died unexpectedly: 2020-09-25 16:06:07,445 ERROR scripts.py:563 -- log_monitor died with exit code 1 2020-09-25 16:06:07,445 ERROR scripts.py:565 -- Killing remaining processes and exiting... srun: error: r10i2n1: task 0: Exited with exit code 1 srun: Terminating job step 377044.0

netw0rkf10w commented 4 years ago

I add a sleep 60 every time I do ray start or ray stop:

for ((  i=0; i<${SLURM_JOB_NUM_NODES}; i++ ))
do
  node=${nodes_array[$i]}
  if [[ "${node}" != "${hostname}" ]] ;then
    port_min=$((6380 + i*10))
    port_max=$((6380 + i*10 + 9))
    echo "******************* Starting Ray worker node on ${node}, port_min = ${port_min}, port_max = ${port_max}"
    srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && sleep 60 && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores} &
    sleep 60
  fi
done
sleep 60

and it seems to work!!

I obtained a strange warning though:

2020-09-25 16:31:04,448 WARNING worker.py:1134 -- The actor or task with ID ffffffffffffffff55c3b2b60100 is pending and cannot currently be scheduled. It requires {} for execution and {} for placement, but this node only has remaining {node:10.159.36.15: 1.000000}, {GPUType:V100: 1.000000}, {CPU: 120.000000}, {memory: 112.011719 GiB}, {GPU: 4.000000}, {object_store_memory: 35.839844 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

At some point I can see that there are 6 trials running at the same time, so it seems to work:

Resources requested: 48/360 CPUs, 12/12 GPUs, 0.0/352.98 GiB heap, 0.0/107.03 GiB objects (0/3.0 GPUType:V100)
Result logdir: /home/user/ray_results/resnet50_coco/WrappedDistributedTorchTrainable
Number of trials: 12 (6 PENDING, 6 RUNNING)
+----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+
| Trial name                                   | status   | loc                |     alpha |       beta |   bilateral_weight |   blur |     gamma |   spatial_weight |   steps |   window |    loss |   accuracy |   training_iteration |
|----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------|
| WrappedDistributedTorchTrainable_bdb17_00000 | RUNNING  |                    | 600.387   | 0.494662   |          258.554   |      1 | 237.326   |         20.9554  |       3 |        9 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00001 | RUNNING  | 10.148.3.221:68850 |  12.9375  | 0.00255001 |          248.646   |      4 |   3.42105 |        178.76    |       3 |        7 | 5.42031 |   0.205922 |                    1 |
| WrappedDistributedTorchTrainable_bdb17_00002 | RUNNING  |                    |   1.99624 | 0.846555   |            4.908   |      2 |   1.33116 |        907.723   |       3 |        7 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00003 | RUNNING  |                    |  76.9     | 0.26823    |            9.33501 |      2 |   5.17456 |        101.455   |       2 |       11 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00004 | RUNNING  |                    |   3.76462 | 0.0484779  |           35.6497  |      4 |  37.4868  |        516.052   |       3 |       11 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00005 | RUNNING  |                    |   2.58193 | 0.037068   |            3.15383 |      1 | 924.28    |          1.62407 |       1 |        7 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00006 | PENDING  |                    |   6.32752 | 0.0994885  |            3.13535 |      4 |  85.9414  |         14.1138  |       2 |        9 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00007 | PENDING  |                    | 131.979   | 0.438311   |            3.27135 |      2 |   2.03457 |          1.63955 |       3 |       11 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00008 | PENDING  |                    | 110.19    | 0.00681927 |          246.562   |      1 |   1.557   |         98.9305  |       1 |        9 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00009 | PENDING  |                    |  24.706   | 0.00155841 |          335.158   |      1 |  15.5629  |         33.2957  |       3 |        7 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00010 | PENDING  |                    |   1.38995 | 0.0813887  |          705.937   |      2 |   2.10977 |        806.93    |       3 |        7 |         |            |                      |
| WrappedDistributedTorchTrainable_bdb17_00011 | PENDING  |                    |   2.05429 | 0.0680983  |          114.04    |      4 | 600.272   |        156.632   |       3 |        7 |         |            |                      |
+----------------------------------------------+----------+--------------------+-----------+------------+--------------------+--------+-----------+------------------+---------+----------+---------+------------+----------------------+

netw0rkf10w commented 4 years ago

In addition to the above strange warning, I also observed some strange behaviors: the logs inside the trials are repeated 3 times! For example:

(pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes (pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes (pid=68905) 2020-09-25 16:33:22,752 INFO trainable.py:85 -- Checkpoint size is 97871187 bytes ... (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth (pid=68917) Saving checkpoint to /tmp/tmpgy8phxr5/checkpoint.pth Result for WrappedDistributedTorchTrainable_bdb17_00006: accuracy: 0.20715384009302748 date: 2020-09-25_16-34-26 done: true experiment_id: e9dfae8331b9413c9370a8dc434e7e92 experiment_tag: 6_alpha=6.3275,beta=0.099489,bilateral_weight=3.1353,blur=4,gamma=85.941,spatial_weight=14.114,steps=2,window=9 hostname: r10i2n1 iterations_since_restore: 2 loss: 5.397783092498779 node_ip: 10.148.3.221 pid: 70209 should_checkpoint: true time_since_restore: 131.63342452049255 time_this_iter_s: 63.13958144187927 time_total_s: 131.63342452049255 timestamp: 1601044466 timesteps_since_restore: 0 training_iteration: 2 trial_id: bdb17_00006

It seems that something, the same thing, is running on all the 3 Ray nodes...

richardliaw commented 4 years ago

Ah I noticed something really odd - notice that on two different worker nodes, they have the same IP address (10.148.3.221) - is this expected?

Number of Ray workers (excluding the head): 2
******************* Starting Ray worker node on r10i2n2, port_min = 6390, port_max = 6399
2020-09-25 16:04:49,292 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node.
2020-09-25 16:04:49,298 INFO resource_spec.py:231 -- Starting Ray with 121.04 GiB memory available for workers and up to 51.89 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-09-25 16:04:49,323 INFO scripts.py:551 --
Started Ray on this node. If you wish to terminate the processes that have been started, run

ray stop
******************* Starting Ray worker node on r10i2n3, port_min = 6400, port_max = 6409
Number of PyTorch data workers: 4

OMP_NUM_THREADS=4

python -u main_tune.py --cfg configs/resnet50_coco.cfg --num_workers 4 --ngpus_per_trial 2
2020-09-25 16:05:26,359 INFO scripts.py:542 -- Using IP address 10.148.3.221 for this node.

netw0rkf10w commented 4 years ago

Ah I noticed something really odd - notice that on two different worker nodes, they have the same IP address (10.148.3.221) - is this expected?

I have no idea, but when starting the worker nodes, I set --address $ip_head, so maybe the worker nodes take the same IP address as the head:

srun --nodes=1 --ntasks=1 --cpus-per-task=1 --nodelist=${node} ${ray_bin} stop --force && sleep 60 && ${ray_bin} start --address $ip_head --min-worker-port=${port_min} --max-worker-port=${port_max} --block --num-cpus ${total_cores}

The tutorial does the same thing so I guess that's expected?

richardliaw commented 4 years ago

Hm, --address $ip_head tells the Ray "worker node" the location of the head node - this is how the cluster is joined together. However, each worker node should a unique IP address (which is printed in the logs).

netw0rkf10w commented 4 years ago

@richardliaw You are right. I didn't notice that the IP address of the head was different than the workers. So here the issue is that all the worker nodes have the same IP address. I have no idea how this can happen.

richardliaw commented 4 years ago

One idea would just stick to using 1 worker node for now. Are you on the Ray slack? Happy to chat more there!

netw0rkf10w commented 4 years ago

I've just tried again with 4 nodes, same issues:

Starting Ray head node on r13i2n1 2020-09-25 20:40:26,620 INFO scripts.py:465 -- Using IP address 10.159.48.11 for this node. Starting Ray worker node on r13i2n2, port_min = 6390, port_max = 6399 2020-09-25 20:42:27,101 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node. Starting Ray worker node on r13i2n3, port_min = 6400, port_max = 6409 2020-09-25 20:43:27,136 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node. Starting Ray worker node on r13i2n4, port_min = 6410, port_max = 6419 2020-09-25 20:44:27,153 INFO scripts.py:542 -- Using IP address 10.148.7.94 for this node.

One idea would just stick to using 1 worker node for now. Are you on the Ray slack? Happy to chat more there!

I'll ping you on Slack. Thanks!

richardliaw commented 4 years ago

Resolved offline!