melonipoika commented 2 years ago

What happened + What you expected to happen

When running on a GCP cluster with 21 machines (1 learner, 20 rollout generators) and using 5 aggregation workers (or more), Impala tends to error out with:

File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/actors.py", line 106, in create_colocated

    raise Exception("Unable to create enough colocated actors, abort.")

Exception: Unable to create enough colocated actors, abort.

Full traceback:

[36mray::CustomImpalaTrainer.__init__()[39m (pid=34990, ip=10.128.0.18)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 729, in __init__

    sync_function_tpl)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 122, in __init__

    self.setup(copy.deepcopy(self.config))

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 845, in setup

    **self._kwargs_for_execution_plan())

  File "/tmp/ray/session_2022-03-31_14-10-28_247346_1994/runtime_resources/py_modules_files/_ray_pkg_a66bee062b2f3dd6/customImpala/customImpala.py", line 1000, in execution_plan

    workers, config)

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/execution/tree_agg.py", line 97, in gather_experiences_tree_aggregation

    create_colocated(Aggregator, [config, g], 1)[0] for g in rollout_groups

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/execution/tree_agg.py", line 97, in <listcomp>

    create_colocated(Aggregator, [config, g], 1)[0] for g in rollout_groups

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/utils/actors.py", line 106, in create_colocated

    raise Exception("Unable to create enough colocated actors, abort.")

Exception: Unable to create enough colocated actors, abort.

Versions / Dependencies

Ray 1.11.0 Python 3.8.13 Debian GNU/Linux 10 (buster)

Reproduction script

Cluster config:

# A unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker-mixed

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 36

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 8.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "us-central1-docker.pkg.dev/<gcp-project-id>/ray-ml/ray-ml:e82623d"
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"

    run_options: ["-v /dev/shm:/dev/shm"]
    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 30

# Cloud-provider specific configuration.
provider:
    availability_zone: us-central1-b
    project_id: <gcp-project-id>
    type: gcp
    region: us-central1

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 32}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-highcpu-32
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cpu-ray-ml-head-v20220331
            tags:
              items:
                - ray-dashboard

    ray_learner_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 1
        # The resources provided by this node type.
        resources: {"CPU": 48, "GPU": 4}

        #worker_setup_commands:
        #    - sudo mount -o remount,size=909G /dev/shm
        #docker:
        #    worker_run_options: ["--shm-size 900g"]

        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: a2-highgpu-4g
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cu113-ray-ml-worker-v20220331
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-a100
                acceleratorCount: 4
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
            # Un-Comment this to launch workers with the Service Account of the Head Node
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@<gcp-project-id>.iam.gserviceaccount.com
                scopes:
                  - https://www.googleapis.com/auth/cloud-platform

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 5
        # The resources provided by this node type.
        resources: {"CPU": 8, "GPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert    

        node_config:
            machineType: n1-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/<gcp-project-id>/global/images/common-cu110-ray-ml-worker-v20220331
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 2
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
            # Un-Comment this to launch workers with the Service Account of the Head Node
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@<gcp-project-id>.iam.gserviceaccount.com
                scopes:
                  - https://www.googleapis.com/auth/cloud-platform

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    - gcloud auth configure-docker us-central1-docker.pkg.dev -q
    - sudo mount -o remount,size=$(($(grep MemTotal /proc/meminfo | tr -dc '0-9') * 830 / 1073741824))G /dev/shm
    - bash -c $'ps -e | grep apt | awk \'{print $1}\' | xargs tail -f --pid || true' # Wait for auto upgrade that might run in the background.

# List of shell commands to run to set up nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"
    - df -h
#    - pip install torch==1.10.2+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --include-dashboard=true
      --dashboard-port=8265

# Command to start ray on worker nodes. You don't need to change this.

worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --object-store-memory $(($(grep MemTotal /proc/meminfo | tr -dc '0-9') * 819))
head_node: {}
worker_nodes: {}

@hemanmach could you please share your IMPALA configuration?

Issue Severity

No response

Related issues

19299

ghost commented 2 years ago

Here is a Impala config that oftentimes leads to the above error:

{
            "num_workers": 120,
            "num_aggregation_workers": 5,
            "num_cpus_for_driver": 40,
            "num_gpus": 4,
            "num_gpus_per_worker": 0.3,
            "num_cpus_per_worker": 1,
            "num_envs_per_worker": 1,
            "max_sample_requests_in_flight_per_worker": 4,
            "num_multi_gpu_tower_stacks": 8,
            "remote_worker_envs": False,
            "rollout_fragment_length": 128,
            "train_batch_size": 512,
            "num_sgd_iter" : 1,
            "lr": 0.0001,
            "vf_loss_coeff": 0.5,
            "create_env_on_driver": False,
            "learner_queue_size": 128,
            "learner_queue_timeout": 600,
            "no_done_at_end": False,
            "soft_horizon": False,
            "placement_strategy": "PACK",
        }