Closed floepfl closed 3 years ago
@floepfl can you try Ray==1.4?
@floepfl , could you also try lowering your num_workers by 1 or 2? It's probably because we dedicate a CPU now for the learner (local worker), which we didn't do in <= 1.2 via the placement groups. We also make sure now that each replay buffer shard has its own CPU (also something we didn't make sure of prior to 1.3). The above may cause the number of CPUs to not be sufficient on your machine and your old configs that would run fine are now causing pending.
You can print out the needed resources by doing a:
print(APEXTrainer.default_resource_request(config=config)._bundles)
The default behavior is this:
[
{'CPU': 5, 'GPU': 1}, # <- learner (1 CPU + 1GPU for learner; 4 CPUs for the replay shards (see config.optimizer.num_replay_shards))
# 32 workers (1 CPU each)
{'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}]
Thanks for your answers. Indeed, taking into account "replay buffer shards" in the number of CPU's I made available in ray.init resolved the issue
Hey everyone,
I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.
Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:
2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
verbose=3)
File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
runner.step()
File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
self.trial_executor.on_no_available_trials(self)
File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
"Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X).
You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.
The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>}
Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.
I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3
Hey everyone,
I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.
Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:
2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265 Traceback (most recent call last): File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module> verbose=3) File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run runner.step() File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step self.trial_executor.on_no_available_trials(self) File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials "Insufficient cluster resources to launch trial: " ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info. The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>}
Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.
I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3
Later I figure it was a genuine mistake. I thought I could fractionally divide the number of CPUs or GPUs by the number of workers.
"num_cpus_per_worker": args.num_cpus / args.num_workers,
"num_gpus_per_worker": args.num_gpus / args.num_workers,
I am not sure if this is an error from division because both division (/) and integer division (//) cause the same error. In config, it actually shows the correct number (num_of_cpus_per_agent = 1), but in Ray, it is automatically rounded up to 2 per agent, which causes that 6 CPUs requirement shown in the screenshot. Other than that I don't have a clear explanation. Delete those two lines from my config file, everything works just fine.
Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!
Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!
Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2.
To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()
). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune
.
config = {
'num_workers': args.num_workers,
"num_gpus": args.num_gpus,
# The below two lines caused bugs
# "num_cpus_per_worker": args.num_cpus // args.num_workers,
# "num_gpus_per_worker": args.num_gpus // args.num_workers,
}
In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker"
and "num_gpus_per_worker"
does.
Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!
Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2. To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into
tune.run()
). I ran my code with a closed-source custom Gym Environment that was then registered inray.tune
.config = { 'num_workers': args.num_workers, "num_gpus": args.num_gpus, # The below two lines caused bugs # "num_cpus_per_worker": args.num_cpus // args.num_workers, # "num_gpus_per_worker": args.num_gpus // args.num_workers, }
In my case, the divisions don't seem to cause the error but specifying the parameters
"num_cpus_per_worker"
and"num_gpus_per_worker"
does.
Yes. After removing num_gpus_per_worker, pending issue is fixed.
I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job from ray.rllib.agents.pg.pg import ( DEFAULT_CONFIG, PGTrainer as trainer) Even though I have 12 CPU cores, I have tried setting config_update = { "env": args.env, "num_gpus": 1, "num_workers": 10, "evaluation_num_workers": 4, "evaluation_interval": 1
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects
Still no Joy, Stuck at Pending !
Hey Jules, what version are you using?
On Mon, Sep 27, 2021 at 4:15 AM Jules @.***> wrote:
I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job from ray.rllib.agents.pg.pg import ( DEFAULT_CONFIG, PGTrainer as trainer) Even though I have 12 CPU cores, I have tried setting config_update = { "env": args.env, "num_gpus": 1, "num_workers": 10, "evaluation_num_workers": 4, "evaluation_interval": 1
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects
Still no Joy, Stuck at Pending !
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/16425#issuecomment-927769003, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZMIIMFJBWM7GBPIHPTUEBG6DANCNFSM46WTBD7A .
Hello I am using ray version 1.6.0 but on Microsoft Windows 10, with Anaconda python 3.7.1. When I had this problem. But I cannot remember which order I installed pip install ray[rllib]
I actually got a ppo ray job working working with a "framework":"torch" under another Anaconda python 3.8.1 conda environment.
I think there should be an argument to set the evaluation_num_workers
for ScalingConfig and count it in total number of workers for clarity or to limit total number of workers to the value that is set in ScalingConfig as if there are additional workers set in config={}
the number will pass.
You can print out the needed resources by doing a:
print(APEXTrainer.default_resource_request(config=config)._bundles)
Do we add this print command within the function 'main'?
Hey everyone,
trying to run Ape-X with tune.run() on ray 1.3.0 and the status remains "pending". I get the same message indefinitely
== Status == Memory usage on this node: 7.5/19.4 GiB Using FIFO scheduling algorithm. Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/8.24 GiB heap, 0.0/4.12 GiB objects Result logdir: /home/flo/ray_results/APEX Number of trials: 1/1 (1 PENDING) +---------------------------+----------+-------+ | Trial name | status | loc | |---------------------------+----------+-------| | APEX_PFCAsset_985a1_00000 | PENDING | | +---------------------------+----------+-------+
If I use the debug flag, it also outputs the following (a lot of times): 2021-06-10 09:22:33,760 DEBUG trial_runner.py:621 -- Running trial APEX_PFCAsset_985a1_00000 2021-06-10 09:22:33,760 DEBUG trial_executor.py:43 -- Trial APEX_PFCAsset_985a1_00000: Status PENDING unchanged. 2021-06-10 09:22:33,761 DEBUG trial_executor.py:62 -- Trial APEX_PFCAsset_985a1_00000: Saving trial metadata.
Downgrading to 1.2.0 solves the problem. I'm using Linux and Windows. Also tried with the last wheel on 2.0.0 on the website and version 1.4.0 and get the same issue. I also experienced it with A3C and another user on slack reports to have experienced it with PPO as well, according to him the problem could lie in the resource allocation.