Scheduling with the default of 0 lifetime CPUs for actors seems to cause issues with cluster utilization.
If you run the below script that should consume all CPUs in the cluster with one actor per CPU without specifying resources for TestActor, the cluster resources are never saturated:
However, if you set num_cpus=1, then all resources are saturated as expected:
Reproduction (REQUIRED)
script:
import time
import ray
ray.init(address="auto")
@ray.remote
class TestActor(object):
def spin(self, timeout):
start = time.time()
while time.time() - start < timeout:
pass
cpus_in_cluster = int(ray.state.cluster_resources()["CPU"])
actors = [TestActor.remote() for _ in range(cpus_in_cluster)]
while True:
ray.get([actor.spin.remote(1) for actor in actors])
cluster.yaml:
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 9
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 9
# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 9
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
InstanceType: m4.4xlarge
ImageId: ami-06d51e91cea0dac8d # Ubuntu 18.04
# You can provision additional disk space with a conf as follows
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 20
# Additional options in the boto docs.
# Provider-specific config for worker nodes, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
worker_nodes:
InstanceType: m4.4xlarge
ImageId: ami-06d51e91cea0dac8d # Ubuntu 18.04
# Run workers on spot by default. Comment this out to use on-demand.
InstanceMarketOptions:
MarketType: spot
# Additional options can be found in the boto docs, e.g.
# SpotOptions:
# MaxPrice: MAX_HOURLY_PRICE
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands:
- sleep 30
- sudo apt-get -qq update
- sudo apt-get install -y build-essential curl unzip
# Install Anaconda.
- wget --quiet https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
- bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
- echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
# Install Ray.
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
- pip install ray[dashboard]
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install boto3==1.4.8 # 1.4.8 adds InstanceMarketOptions
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.
What is the problem?
Scheduling with the default of 0 lifetime CPUs for actors seems to cause issues with cluster utilization.
If you run the below script that should consume all CPUs in the cluster with one actor per CPU without specifying resources for
TestActor
, the cluster resources are never saturated:However, if you set
num_cpus=1
, then all resources are saturated as expected:Reproduction (REQUIRED)
script:
cluster.yaml
: