ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.62k forks source link

[scheduling] Default actor lifetime resources (0 CPUs) cause cluster not to be saturated #6814

Open edoakes opened 4 years ago

edoakes commented 4 years ago

What is the problem?

Scheduling with the default of 0 lifetime CPUs for actors seems to cause issues with cluster utilization.

If you run the below script that should consume all CPUs in the cluster with one actor per CPU without specifying resources for TestActor, the cluster resources are never saturated:

Screen Shot 2020-01-16 at 6 18 02 PM

However, if you set num_cpus=1, then all resources are saturated as expected:

Screen Shot 2020-01-16 at 6 17 15 PM

Reproduction (REQUIRED)

script:

import time

import ray
ray.init(address="auto")

@ray.remote
class TestActor(object):
    def spin(self, timeout):
        start = time.time()
        while time.time() - start < timeout:
            pass

cpus_in_cluster = int(ray.state.cluster_resources()["CPU"])
actors = [TestActor.remote() for _ in range(cpus_in_cluster)]

while True:
    ray.get([actor.spin.remote(1) for actor in actors])

cluster.yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 9

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 9

# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 9

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes are currently spread between zones by a round-robin approach,
    # however this implementation detail should not be relied upon.
    availability_zone: us-west-2a,us-west-2b

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
    InstanceType: m4.4xlarge
    ImageId: ami-06d51e91cea0dac8d # Ubuntu 18.04

    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 20

    # Additional options in the boto docs.

# Provider-specific config for worker nodes, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
worker_nodes:
    InstanceType: m4.4xlarge
    ImageId: ami-06d51e91cea0dac8d # Ubuntu 18.04

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        # Additional options can be found in the boto docs, e.g.
        #   SpotOptions:
        #       MaxPrice: MAX_HOURLY_PRICE

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
    - sleep 30
    - sudo apt-get -qq update
    - sudo apt-get install -y build-essential curl unzip
    # Install Anaconda.
    - wget --quiet https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
    - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
    - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
    # Install Ray.
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
    - pip install ray[dashboard]

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
edoakes commented 4 years ago

cc @istoica @ericl