Open melonipoika opened 2 years ago
Here is a Impala config that oftentimes leads to the above error:
{
"num_workers": 120,
"num_aggregation_workers": 5,
"num_cpus_for_driver": 40,
"num_gpus": 4,
"num_gpus_per_worker": 0.3,
"num_cpus_per_worker": 1,
"num_envs_per_worker": 1,
"max_sample_requests_in_flight_per_worker": 4,
"num_multi_gpu_tower_stacks": 8,
"remote_worker_envs": False,
"rollout_fragment_length": 128,
"train_batch_size": 512,
"num_sgd_iter" : 1,
"lr": 0.0001,
"vf_loss_coeff": 0.5,
"create_env_on_driver": False,
"learner_queue_size": 128,
"learner_queue_timeout": 600,
"no_done_at_end": False,
"soft_horizon": False,
"placement_strategy": "PACK",
}
Hey @melonipoika and @hemanmach , thanks for posting this issue. Let's try to count the number of CPUs you have a) available in total on all machines, and b) available on your head-node (where learning happens on GPUs and where RLlib will try to create all your aggregation workers as these have to be co-located with the learner). Your IMPALA config suggests that you require:
Hi @sven1977, thanks for looking into this! We are using the "ray_learner" node type for the trainer process. It is the only node type with 4 GPUs, so we assumed that the learning would be forced to happen there. The Ray Dashboard also showed the trainer process under that node when the job was running. This node has 48 cpus in total.
Hey @hemanmach , that's correct, the learning will happen on that node with the 4 GPUs, and it does seem to have enough CPUs to place the aggregation workers there as well. Ok, let us try to reproduce this issue ...
Thanks @sven1977 . Please let me know if I can help reproducing this issue. I am in your timezone :-)
Ok, I brought up a AWS cluster with 4 GPUs on the head node and up to 10 worker CPU-only machines (using auto-scaling). The following config worked fine and the job started running (and learned the task). The only difference between my config below and yours is the num_gpus_per_worker=0.3
setting. I'll confirm this now with a GPU-worker cluster.
pong-impala:
env: PongNoFrameskip-v4
run: IMPALA
config:
num_workers: 120
num_aggregation_workers: 5
num_cpus_for_driver: 40
num_gpus: 4
num_gpus_per_worker: 0 # <--- HERE -- only difference: 0.3
num_cpus_per_worker: 1
num_envs_per_worker: 1
max_sample_requests_in_flight_per_worker: 4
num_multi_gpu_tower_stacks: 8
remote_worker_envs: False
rollout_fragment_length: 128
train_batch_size: 512
num_sgd_iter : 1
lr: 0.0001
vf_loss_coeff: 0.5
create_env_on_driver: False
learner_queue_size: 128
learner_queue_timeout: 600
no_done_at_end: False
soft_horizon: False
I do not see the Exception: Unable to create enough colocated actors, abort.
message.
I'm using g3.16xlarge
head node and up to 10 m5.4xlarge
worker nodes.
Will try to swap out the worker nodes by GPU machines and see what happens with num_gpus_per_worker=0.3
.
Hi @sven1977, could you try using 20 nodes for rollout generation? I found that with 20 nodes and 5 aggregation workers, it was much more likely that I got that exception. Roughly more than 2/3rd of the time.
I also filed an issue about the autoscaler w/ GPU problem that I'm seeing: https://github.com/ray-project/ray/issues/24428
What happened + What you expected to happen
When running on a GCP cluster with 21 machines (1 learner, 20 rollout generators) and using 5 aggregation workers (or more), Impala tends to error out with:
Full traceback:
Versions / Dependencies
Ray 1.11.0 Python 3.8.13 Debian GNU/Linux 10 (buster)
Reproduction script
Cluster config:
@hemanmach could you please share your IMPALA configuration?
Issue Severity
No response
Related issues
19299