Closed AmeerHajAli closed 3 years ago
CC @sven1977 @richardliaw @javi-redondo
I'll mark this as a release blocker for now, though @krfricke please feel free to deprio once you've narrowed the cause!
I'll use this issue as a log.
With a simple example, autoscaling seems to work fine:
import time
import ray
from ray import tune
def train(config):
time.sleep(30)
return 4
ray.init(address="auto")
tune.run(
train,
resources_per_trial=tune.PlacementGroupFactory([{"CPU": 2}] * 4))
Output:
(autoscaler +5s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +5s) Adding 3 nodes of type cpu_2_spot.
(cluster.yaml)
cluster_name: ray-tune-autoscaling-test
max_workers: 20
upscaling_speed: 20
idle_timeout_minutes: 0
docker:
image: rayproject/ray:nightly
container_name: ray_container
pull_before_run: true
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a
cache_stopped_nodes: false
available_node_types:
cpu_2_ondemand:
node_config:
InstanceType: m5.large
resources: {"CPU": 2}
min_workers: 0
max_workers: 0
cpu_2_spot:
node_config:
InstanceType: m5.large
InstanceMarketOptions:
MarketType: spot
resources: {"CPU": 2}
min_workers: 0
max_workers: 20
auth:
ssh_user: ubuntu
head_node_type: cpu_2_ondemand
worker_default_node_type: cpu_2_spot
file_mounts: {
"/autoscaling": "./"
}
Simple PPO using Cartpole works as well
import ray
from ray import tune
config = {
"env": "CartPole-v0",
"num_workers": 10
}
ray.init(address="auto")
tune.run(
"PPO",
stop={"training_iteration": 10_000},
config=config)
(autoscaler +3s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +3s) Adding 5 nodes of type cpu_2_spot.
So the culprit was actually just the ray.init()
, which did not connect to the existing ray instance, but started a new (non-autoscaling) cluster instead.
Changing this to ray.init(address="auto")
solves the problem:
(autoscaler +8s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0. │
(autoscaler +8s) Adding 49 nodes of type ray.worker.default.
I'll close this issue for now. Let me know if this issue persists with the change.
By the way, shouldn't ray.init()
automatically connect to an existing cluster?
I am using
tune.run()
onrayproject/ray-ml:nightly-gpu
docker image, I callray up config.yaml -y
and thenray attach config.yaml
, thenpython unity3d_env_local.py
. The cluster never scales up workers and I get the following output:my code (inside unity3d_env_local.py file). I :
my cluster yaml:
the file mounts includes the
unity3d_env_local.py
pasted above and the following file unzipped insidetennis
directory locally: https://drive.google.com/file/d/1qBn_T0ukNj-w6ggza_xl1mXqVhYQjHyK/view?usp=sharingI tried to use placement groups like this and it works, so I am not sure what is going on:
We have a blogpost waiting on this working, can you please help?