Worker nodes don't start for ray-lightning & aws

I'd like to run PyTorch on AWS multinode instances. I found this article explaining how to enable multinode training on AWS with Ray-Lightning, and I've been trying to reproduce it.

But I couldn't run the code on multiple nodes on AWS. It successfully runs on a single node (num_workers=1 in the python code below), but failed for num_workers=2. It got stuck after printing HPU available: False, using: 0 HPUs (see below for details).

And it seems when I checked aws dashboard, no worker nodes had started besides head node (i.e. just one instance was running).

Could you please give my any advice? Thanks!

import pytorch_lightning as pl
from model import LightningMNISTClassifier

import ray
from ray_lightning import RayStrategy

num_workers = 2
num_cpus_per_worker = 1
use_gpu = True

def main():
    ray.init("auto")
    # ray.init()

    config = {'lr': 0.001, 'batch_size': 32}
    model = LightningMNISTClassifier(config, data_dir="./data")

    trainer = pl.Trainer(
        max_epochs=1,
        strategy=RayStrategy(
            num_workers=num_workers,
            num_cpus_per_worker=num_cpus_per_worker,
            use_gpu=use_gpu,
        ),
    )
    trainer.fit(model)

if __name__ == '__main__':
    main()

# For details, see https://docs.ray.io/en/master/cluster/vms/references/ray-cluster-configuration.html

# A unique identifier for the head node and workers of this cluster.
cluster_name: mnist_gpu

# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# Cloud-provider specific configuration.
provider:
  type: aws
  region: us-east-2
  availability_zone: us-east-2a, us-east-2b
  cache_stopped_nodes: False

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
      node_config:
        InstanceType: g3s.xlarge
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        node_config:
          InstanceType: g3s.xlarge

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    '/home/ubuntu': '/Users/toru34/Dropbox/idein/pytorch-multinode-training/ray/mnist',
}

# A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with head setup commands for head and with worker setup commands for workers.
setup_commands:
  - pip install --ignore-installed PyYAML==6.0
  - pip install -r requirements.txt

# A list of commands to run to set up the head node. These commands will be merged with the general setup commands.
head_setup_commands: []

# A list of commands to run to set up the worker nodes. These commands will be merged with the general setup commands.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

...
  [7/7] Starting the Ray runtime
Stopped all 7 Ray processes.
Shared connection to 18.191.254.106 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.11.87

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.31.11.87:6379'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')

  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:10001')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop
Shared connection to 18.191.254.106 closed.
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /Users/toru34/Dropbox/idein/pytorch-multinode-training/ray/mnist/config_gpu.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /Users/toru34/Dropbox/idein/pytorch-multinode-training/ray/mnist/config_gpu.yaml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes -i /Users/toru34/.ssh/ray-autoscaler_us-east-2.pem ubuntu@18.191.254.106
2022-09-12 05:48:15,766 INFO util.py:335 -- setting max workers for head node type to 0
Fetched IP: 18.191.254.106
2022-09-12 05:48:17,017 INFO util.py:335 -- setting max workers for head node type to 0
Fetched IP: 18.191.254.106
2022-09-11 20:48:20,077 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.11.87:6379...
2022-09-11 20:48:20,085 INFO worker.py:1518 -- Connected to Ray cluster.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
^C^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 1580, in shutdown
    time.sleep(0.5)
KeyboardInterrupt
^C^C^C^C^C
Shared connection to 18.191.254.106 closed.
Error: Command failed:

  ssh -tt -i /Users/toru34/.ssh/ray-autoscaler_us-east-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_0e7c5ecd7a/3ec1da77c0/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@18.191.254.106 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/train.py)'

ray-project / ray_lightning

Worker nodes don't start for ray-lightning & aws #210