ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.34k stars 5.64k forks source link

[rllib][tune] Training stuck in "Pending" status #16425

Closed floepfl closed 3 years ago

floepfl commented 3 years ago

Hey everyone,

trying to run Ape-X with tune.run() on ray 1.3.0 and the status remains "pending". I get the same message indefinitely

== Status == Memory usage on this node: 7.5/19.4 GiB Using FIFO scheduling algorithm. Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/8.24 GiB heap, 0.0/4.12 GiB objects Result logdir: /home/flo/ray_results/APEX Number of trials: 1/1 (1 PENDING) +---------------------------+----------+-------+ | Trial name | status | loc | |---------------------------+----------+-------| | APEX_PFCAsset_985a1_00000 | PENDING | | +---------------------------+----------+-------+

If I use the debug flag, it also outputs the following (a lot of times): 2021-06-10 09:22:33,760 DEBUG trial_runner.py:621 -- Running trial APEX_PFCAsset_985a1_00000 2021-06-10 09:22:33,760 DEBUG trial_executor.py:43 -- Trial APEX_PFCAsset_985a1_00000: Status PENDING unchanged. 2021-06-10 09:22:33,761 DEBUG trial_executor.py:62 -- Trial APEX_PFCAsset_985a1_00000: Saving trial metadata.

Downgrading to 1.2.0 solves the problem. I'm using Linux and Windows. Also tried with the last wheel on 2.0.0 on the website and version 1.4.0 and get the same issue. I also experienced it with A3C and another user on slack reports to have experienced it with PPO as well, according to him the problem could lie in the resource allocation.

richardliaw commented 3 years ago

@floepfl can you try Ray==1.4?

sven1977 commented 3 years ago

@floepfl , could you also try lowering your num_workers by 1 or 2? It's probably because we dedicate a CPU now for the learner (local worker), which we didn't do in <= 1.2 via the placement groups. We also make sure now that each replay buffer shard has its own CPU (also something we didn't make sure of prior to 1.3). The above may cause the number of CPUs to not be sufficient on your machine and your old configs that would run fine are now causing pending.

sven1977 commented 3 years ago

You can print out the needed resources by doing a:

print(APEXTrainer.default_resource_request(config=config)._bundles)
sven1977 commented 3 years ago

The default behavior is this:

[
{'CPU': 5, 'GPU': 1},  # <- learner (1 CPU + 1GPU for learner; 4 CPUs for the replay shards (see config.optimizer.num_replay_shards))
# 32 workers (1 CPU each)
{'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}]
floepfl commented 3 years ago

Thanks for your answers. Indeed, taking into account "replay buffer shards" in the number of CPU's I made available in ray.init resolved the issue

mickelliu commented 3 years ago

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Capture Capture2

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:

2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>} 

Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

mickelliu commented 3 years ago

Hey everyone,

I encountered a similar issue today but with QMIX. I have tried ray 2.0.0 dev0, 1.5.0, 1.4.0 but with no success; they were all stuck at the "Pending" status, no matter how I adjust the number of workers and CPUs. I ran the trial overnight and it is still stuck at pending.

Capture Capture2

Then I saw this issue thread and lower my ray version to 1.2.0, however a new error pops up:

2021-08-04 13:21:32,756 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "/data/[USER]/code/mate-nips/rllib_train.py", line 87, in <module>
    verbose=3)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 404, in step
    self.trial_executor.on_no_available_trials(self)
  File "/home/[USER]/anaconda3/envs/nips-mate/lib/python3.7/site-packages/ray/tune/trial_executor.py", line 186, in on_no_available_trials
    "Insufficient cluster resources to launch trial: "
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 17.0 CPUs, 0.0 GPUs, but the cluster has only 16 CPUs, 0 GPUs, 77.78 GiB heap, 25.73 GiB objects (1.0 node:162.105.162.205, 1.0 accelerator_type:X). 

You can adjust the resource requests of RLlib agents by setting num_workers, num_gpus, and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'env': 'mate_centralized', 'log_level': 'WARN', 'num_workers': 16, 'num_cpus_per_worker': 1.0, 'num_gpus': 0, 'num_gpus_per_worker': 0.0, 'framework': 'torch', 'exploration_config': {'type': 'EpsilonGreedy', 'initial_epsilon': 1.0, 'final_epsilon': 0.02, 'epsilon_timesteps': 10000}, 'horizon': None, 'callbacks': <class 'module.AuxiliaryRewardCallbacks'>} 

Interestingly the trial requested resource will always be 1 CPU more than the amount that I initialized to ray. If I set the resource to 4 CPUs, it would ask for 5. If I set it to 1 CPU, it would ask for 2. Though I only saw this error once I entirely downgraded ray to 1.2.0, and higher versions didn't even pop this error.

I haven't tested if this is a problem specific to the QMIX algorithm. I used a custom Gym environment, Python 3.7, Pytorch 1.4, Gym 0.18.3

Later I figure it was a genuine mistake. I thought I could fractionally divide the number of CPUs or GPUs by the number of workers.

        "num_cpus_per_worker": args.num_cpus / args.num_workers,
        "num_gpus_per_worker": args.num_gpus / args.num_workers,

I am not sure if this is an error from division because both division (/) and integer division (//) cause the same error. In config, it actually shows the correct number (num_of_cpus_per_agent = 1), but in Ray, it is automatically rounded up to 2 per agent, which causes that 6 CPUs requirement shown in the screenshot. Other than that I don't have a clear explanation. Delete those two lines from my config file, everything works just fine.

xwjiang2010 commented 3 years ago

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

mickelliu commented 3 years ago

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2. To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.

config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}

In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

jamesliu commented 3 years ago

Hi @mickelliu, is this still an issue for you? If it is, can you provide a minimal set up for me to try? Thanks!

Hi @xwjiang2010, although I resolved this issue by setting up my config correctly, I do believe that the problem (about pending stuck) still persists in Ray version > 1.2. To reproduce the issue I had before, you could downgrade your Ray version to 1.2, then add these lines to the Rllib Algorithm Config dict (which will be passed into tune.run()). I ran my code with a closed-source custom Gym Environment that was then registered in ray.tune.

config = {
        'num_workers': args.num_workers,
        "num_gpus": args.num_gpus,
        # The below two lines caused bugs
        # "num_cpus_per_worker": args.num_cpus // args.num_workers,
        # "num_gpus_per_worker": args.num_gpus // args.num_workers,
}

In my case, the divisions don't seem to cause the error but specifying the parameters "num_cpus_per_worker" and "num_gpus_per_worker" does.

Yes. After removing num_gpus_per_worker, pending issue is fixed.

JulesVerny commented 3 years ago

I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job from ray.rllib.agents.pg.pg import ( DEFAULT_CONFIG, PGTrainer as trainer) Even though I have 12 CPU cores, I have tried setting config_update = { "env": args.env, "num_gpus": 1, "num_workers": 10, "evaluation_num_workers": 4, "evaluation_interval": 1

Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects

Still no Joy, Stuck at Pending !

richardliaw commented 3 years ago

Hey Jules, what version are you using?

On Mon, Sep 27, 2021 at 4:15 AM Jules @.***> wrote:

I have a similar issue, just on basic Policy Gradient configuration rrlib fails to start the job from ray.rllib.agents.pg.pg import ( DEFAULT_CONFIG, PGTrainer as trainer) Even though I have 12 CPU cores, I have tried setting config_update = { "env": args.env, "num_gpus": 1, "num_workers": 10, "evaluation_num_workers": 4, "evaluation_interval": 1

Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/31.34 GiB heap, 0.0/15.67 GiB objects

Still no Joy, Stuck at Pending !

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/16425#issuecomment-927769003, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZMIIMFJBWM7GBPIHPTUEBG6DANCNFSM46WTBD7A .

JulesVerny commented 3 years ago

Hello I am using ray version 1.6.0 but on Microsoft Windows 10, with Anaconda python 3.7.1. When I had this problem. But I cannot remember which order I installed pip install ray[rllib]

I actually got a ppo ray job working working with a "framework":"torch" under another Anaconda python 3.8.1 conda environment.

cheadrian commented 1 year ago

I think there should be an argument to set the evaluation_num_workers for ScalingConfig and count it in total number of workers for clarity or to limit total number of workers to the value that is set in ScalingConfig as if there are additional workers set in config={} the number will pass.

gurdipk commented 6 months ago

You can print out the needed resources by doing a:

print(APEXTrainer.default_resource_request(config=config)._bundles)

Do we add this print command within the function 'main'?