ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.84k stars 5.75k forks source link

[RLlib] `create_colocated` fails with "Unable to create enough colocated actors" #24071

Open melonipoika opened 2 years ago

melonipoika commented 2 years ago

What happened + What you expected to happen

When running on a GCP cluster with 21 machines (1 learner, 20 rollout generators) and using 5 aggregation workers (or more), Impala tends to error out with:

File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/actors.py", line 106, in create_colocated

    raise Exception("Unable to create enough colocated actors, abort.")

Exception: Unable to create enough colocated actors, abort.

Full traceback:

[36mray::CustomImpalaTrainer.__init__()[39m (pid=34990, ip=10.128.0.18)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 729, in __init__

    sync_function_tpl)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 122, in __init__

    self.setup(copy.deepcopy(self.config))

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 845, in setup

    **self._kwargs_for_execution_plan())

  File "/tmp/ray/session_2022-03-31_14-10-28_247346_1994/runtime_resources/py_modules_files/_ray_pkg_a66bee062b2f3dd6/customImpala/customImpala.py", line 1000, in execution_plan

    workers, config)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/execution/tree_agg.py", line 97, in gather_experiences_tree_aggregation

    create_colocated(Aggregator, [config, g], 1)[0] for g in rollout_groups

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/execution/tree_agg.py", line 97, in <listcomp>

    create_colocated(Aggregator, [config, g], 1)[0] for g in rollout_groups

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/actors.py", line 106, in create_colocated

    raise Exception("Unable to create enough colocated actors, abort.")

Exception: Unable to create enough colocated actors, abort.

Versions / Dependencies

Ray 1.11.0 Python 3.8.13 Debian GNU/Linux 10 (buster)

Reproduction script

Cluster config:

# A unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker-mixed

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 36

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 8.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "us-central1-docker.pkg.dev/<gcp-project-id>/ray-ml/ray-ml:e82623d"
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"

    run_options: ["-v /dev/shm:/dev/shm"]
    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 30

# Cloud-provider specific configuration.
provider:
    availability_zone: us-central1-b
    project_id: <gcp-project-id>
    type: gcp
    region: us-central1

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 32}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-highcpu-32
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cpu-ray-ml-head-v20220331
            tags:
              items:
                - ray-dashboard

    ray_learner_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 1
        # The resources provided by this node type.
        resources: {"CPU": 48, "GPU": 4}

        #worker_setup_commands:
        #    - sudo mount -o remount,size=909G /dev/shm
        #docker:
        #    worker_run_options: ["--shm-size 900g"]

        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: a2-highgpu-4g
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cu113-ray-ml-worker-v20220331
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-a100
                acceleratorCount: 4
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
            # Un-Comment this to launch workers with the Service Account of the Head Node
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@<gcp-project-id>.iam.gserviceaccount.com
                scopes:
                  - https://www.googleapis.com/auth/cloud-platform

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 5
        # The resources provided by this node type.
        resources: {"CPU": 8, "GPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert    

        node_config:
            machineType: n1-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cu110-ray-ml-worker-v20220331
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 2
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
            # Un-Comment this to launch workers with the Service Account of the Head Node
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@<gcp-project-id>.iam.gserviceaccount.com
                scopes:
                  - https://www.googleapis.com/auth/cloud-platform

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    - gcloud auth configure-docker us-central1-docker.pkg.dev -q
    - sudo mount -o remount,size=$(($(grep MemTotal /proc/meminfo | tr -dc '0-9') * 830 / 1073741824))G /dev/shm
    - bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background.

# List of shell commands to run to set up nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"
    - df -h
#    - pip install torch==1.10.2+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --include-dashboard=true
      --dashboard-port=8265

# Command to start ray on worker nodes. You don't need to change this.

worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --object-store-memory $(($(grep MemTotal /proc/meminfo | tr -dc '0-9') * 819))
head_node: {}
worker_nodes: {}

@hemanmach could you please share your IMPALA configuration?

Issue Severity

No response

Related issues

19299

ghost commented 2 years ago

Here is a Impala config that oftentimes leads to the above error:

{
            "num_workers": 120,
            "num_aggregation_workers": 5,
            "num_cpus_for_driver": 40,
            "num_gpus": 4,
            "num_gpus_per_worker": 0.3,
            "num_cpus_per_worker": 1,
            "num_envs_per_worker": 1,
            "max_sample_requests_in_flight_per_worker": 4,
            "num_multi_gpu_tower_stacks": 8,
            "remote_worker_envs": False,
            "rollout_fragment_length": 128,
            "train_batch_size": 512,
            "num_sgd_iter" : 1,
            "lr": 0.0001,
            "vf_loss_coeff": 0.5,
            "create_env_on_driver": False,
            "learner_queue_size": 128,
            "learner_queue_timeout": 600,
            "no_done_at_end": False,
            "soft_horizon": False,
            "placement_strategy": "PACK",
        }
sven1977 commented 2 years ago

Hey @melonipoika and @hemanmach , thanks for posting this issue. Let's try to count the number of CPUs you have a) available in total on all machines, and b) available on your head-node (where learning happens on GPUs and where RLlib will try to create all your aggregation workers as these have to be co-located with the learner). Your IMPALA config suggests that you require:

ghost commented 2 years ago

Hi @sven1977, thanks for looking into this! We are using the "ray_learner" node type for the trainer process. It is the only node type with 4 GPUs, so we assumed that the learning would be forced to happen there. The Ray Dashboard also showed the trainer process under that node when the job was running. This node has 48 cpus in total.

sven1977 commented 2 years ago

Hey @hemanmach , that's correct, the learning will happen on that node with the 4 GPUs, and it does seem to have enough CPUs to place the aggregation workers there as well. Ok, let us try to reproduce this issue ...

melonipoika commented 2 years ago

Thanks @sven1977 . Please let me know if I can help reproducing this issue. I am in your timezone :-)

sven1977 commented 2 years ago

Ok, I brought up a AWS cluster with 4 GPUs on the head node and up to 10 worker CPU-only machines (using auto-scaling). The following config worked fine and the job started running (and learned the task). The only difference between my config below and yours is the num_gpus_per_worker=0.3 setting. I'll confirm this now with a GPU-worker cluster.

pong-impala:
    env: PongNoFrameskip-v4
    run: IMPALA
    config:
        num_workers: 120
        num_aggregation_workers: 5
        num_cpus_for_driver: 40
        num_gpus: 4
        num_gpus_per_worker: 0 # <--- HERE -- only difference: 0.3
        num_cpus_per_worker: 1
        num_envs_per_worker: 1
        max_sample_requests_in_flight_per_worker: 4
        num_multi_gpu_tower_stacks: 8
        remote_worker_envs: False
        rollout_fragment_length: 128
        train_batch_size: 512
        num_sgd_iter : 1
        lr: 0.0001
        vf_loss_coeff: 0.5
        create_env_on_driver: False
        learner_queue_size: 128
        learner_queue_timeout: 600
        no_done_at_end: False
        soft_horizon: False
sven1977 commented 2 years ago

I do not see the Exception: Unable to create enough colocated actors, abort. message. I'm using g3.16xlarge head node and up to 10 m5.4xlarge worker nodes.

Will try to swap out the worker nodes by GPU machines and see what happens with num_gpus_per_worker=0.3.

ghost commented 2 years ago

Hi @sven1977, could you try using 20 nodes for rollout generation? I found that with 20 nodes and 5 aggregation workers, it was much more likely that I got that exception. Roughly more than 2/3rd of the time.

sven1977 commented 2 years ago

I also filed an issue about the autoscaler w/ GPU problem that I'm seeing: https://github.com/ray-project/ray/issues/24428