ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[RLlib] Can not allocate GPU resource #38060

Closed R0B1NNN1 closed 1 year ago

R0B1NNN1 commented 1 year ago

What happened + What you expected to happen

Hi,

I am trying to use gpu to accelerate the training process. And here is some parts of my codes.

ray.init(
  num_cpus=16,  
  num_gpus=1, 
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

and

config = (
    PPOConfig()
    .environment(env="mobile-medium-ma-v0")
    .framework("torch")
    .resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
    .rollouts(num_rollout_workers=15)
)

And it shows me the following bugs. I am wondering why? Thanks for replying in advance.

2023-08-03 14:51:47,125 ERROR tune_controller.py:873 -- Trial task failed for trial PPO_mobile-medium-ma-v0_cd219_00000
Traceback (most recent call last):
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\air\execution\_internal\event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\worker.py", line 2542, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=30316, ip=127.0.0.1, actor_id=c243566c84a3d139d0151cba01000000, repr=PPO)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 242, in _setup
    self.add_workers(
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 635, in add_workers
    raise result.get()
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\utils\actor_manager.py", line 488, in __fetch_result
    result = ray.get(r)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\worker.py", line 2542, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=30360, ip=127.0.0.1, actor_id=440aa6f3c23f53fbb9850e2601000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001FF094645B0>)
  File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1378, in ray._raylet.execute_task.function_executor
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\function_manager.py", line 724, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\util\tracing\tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 715, in __init__
    raise RuntimeError(
RuntimeError: Found 0 GPUs on your machine (GPU devices found: [])! If your
    machine does not have any GPUs, you should set the config keys `num_gpus` and
    `num_gpus_per_worker` to 0 (they may be set to 1 by default for your
    particular RL algorithm).
To change the config for the `rllib train|rollout` command, use
  `--config={'[key]': '[value]'}` on the command line.
To change the config for `tune.Tuner().fit()` in a script: Modify the python dict
  passed to `tune.Tuner(param_space=[...]).fit()`.
To change the config for an RLlib Algorithm instance: Modify the python dict
  passed to the Algorithm's constructor, e.g. `PPO(config=[...])`.

During handling of the above exception, another exception occurred:

ray::PPO.__init__() (pid=30316, ip=127.0.0.1, actor_id=c243566c84a3d139d0151cba01000000, repr=PPO)
  File "python\ray\_raylet.pyx", line 1431, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1510, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1378, in ray._raylet.execute_task.function_executor
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\_private\function_manager.py", line 724, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\util\tracing\tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 475, in __init__
    super().__init__(
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\tune\trainable\trainable.py", line 170, in __init__
    self.setup(copy.deepcopy(self.config))
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\util\tracing\tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 601, in setup
    self.workers = WorkerSet(
  File "c:\Users\18406\anaconda3\envs\rayenvtest\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 194, in __init__
    raise e.args[0].args[2]
RuntimeError: Found 0 GPUs on your machine (GPU devices found: [])! If your
    machine does not have any GPUs, you should set the config keys `num_gpus` and
    `num_gpus_per_worker` to 0 (they may be set to 1 by default for your
    particular RL algorithm).
To change the config for the `rllib train|rollout` command, use
  `--config={'[key]': '[value]'}` on the command line.
To change the config for `tune.Tuner().fit()` in a script: Modify the python dict
  passed to `tune.Tuner(param_space=[...]).fit()`.
To change the config for an RLlib Algorithm instance: Modify the python dict
  passed to the Algorithm's constructor, e.g. `PPO(config=[...])`.

Versions / Dependencies

Ray = 2.5.1 Python = 3.9.17

Reproduction script

ray.init(
  num_cpus=16,  
  num_gpus=1, 
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

and

config = (
    PPOConfig()
    .environment(env="mobile-medium-ma-v0")
    .framework("torch")
    .resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
    .rollouts(num_rollout_workers=15)
)

Issue Severity

None

ArturNiederfahrenhorst commented 1 year ago

Do you have a GPU? What happens if you don't call ray.init() (this is done by RLlib under the hood already)?

R0B1NNN1 commented 1 year ago

@ArturNiederfahrenhorst Hi, thanks for replying.

Yes, I do have a GPU. I used :

import GPUtil

gpus = GPUtil.getGPUs()
print("Num GPUs Available:", len(gpus))

and it shows me i have one gpu. And I tried to run it without ray init, it still show me same problem. It that some problem with the cuda ?

Thanks.

ArturNiederfahrenhorst commented 1 year ago

Please post your entire reproduction script. Also, please run nvidia-smi and see if it runs to your expectations.

R0B1NNN1 commented 1 year ago
import gymnasium
from ray.tune.registry import register_env

# use the mobile-env RLlib wrapper for RLlib
def register(config):
    # importing mobile_env registers the included environments
    import mobile_env
    from mobile_env.wrappers.multi_agent import RLlibMAWrapper

    env = gymnasium.make("mobile-medium-ma-v0")
    return RLlibMAWrapper(env)

# register the predefined scenario with RLlib
register_env("mobile-medium-ma-v0", register)

import ray

# init ray with available CPUs (and GPUs) and init ray
ray.init(
  num_cpus=12,   # change to your available number of CPUs
  num_gpus=1,
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

import ray.air
from ray.rllib.algorithms.ppo import PPOConfig

from ray.rllib.policy.policy import PolicySpec
from ray.tune.stopper import MaximumIterationStopper

# Create an RLlib config using multi-agent PPO on mobile-env's small scenario.
config = (
    PPOConfig()
    .environment(env="mobile-medium-ma-v0")
    .framework("torch")
    # Here, we configure all agents to share the same policy.
    .multi_agent(
        policies={
            'policy_agent0': PolicySpec(),
            'policy_agent1': PolicySpec(),
            'policy_agent2': PolicySpec(),
            'policy_agent3': PolicySpec(),
            'policy_agent4': PolicySpec(),
            'policy_agent5': PolicySpec(),
            'policy_agent6': PolicySpec(),
            'policy_agent7': PolicySpec(),
            'policy_agent8': PolicySpec(),
            'policy_agent9': PolicySpec(),
            'policy_agent10': PolicySpec(),
            'policy_agent11': PolicySpec(),
            'policy_agent12': PolicySpec(),
            'policy_agent13': PolicySpec(),
            'policy_agent14': PolicySpec()
        },
        policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: 'policy_agent' + str(agent_id),
    )
    # RLlib needs +1 CPU than configured below (for the driver/traininer?)
    .resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
    .rollouts(num_rollout_workers=15)
)

# Create the Trainer/Tuner and define how long to train
tuner = ray.tune.Tuner(
    "PPO",
    run_config=ray.air.RunConfig(
        # Save the training progress and checkpoints locally under the specified subfolder.
        storage_path="./DTDE_medium_tests",
        # Control training length by setting the number of iterations. 1 iter = 4000 time steps by default.
        stop=MaximumIterationStopper(max_iter=10),
        checkpoint_config=ray.air.CheckpointConfig(checkpoint_at_end=True),
    ),
    param_space=config,
)

# Run training and save the result
result_grid = tuner.fit()

And when I run ``nvidia-smi

Fri Aug  4 17:51:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   57C    P8              18W / 130W |   1488MiB /  8192MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1100    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe    N/A      |
|    0   N/A  N/A      3924    C+G   ...GeForce Experience\NVIDIA Share.exe    N/A      |
|    0   N/A  N/A      8408    C+G   ...les\Microsoft OneDrive\OneDrive.exe    N/A      |
|    0   N/A  N/A     10112    C+G   ...nr4m\radeonsoftware\AMDRSSrcExt.exe    N/A      |
|    0   N/A  N/A     10580    C+G   ...Programs\Microsoft VS Code\Code.exe    N/A      |
|    0   N/A  N/A     11540    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe    N/A      |
|    0   N/A  N/A     11852    C+G   ...n\115.0.1901.188\msedgewebview2.exe    N/A      |
|    0   N/A  N/A     16464    C+G   ...paper_engine\bin\webwallpaper32.exe    N/A      |
|    0   N/A  N/A     16960    C+G   ...l\Microsoft\Teams\current\Teams.exe    N/A      |
|    0   N/A  N/A     17288    C+G   ...US\ArmouryDevice\asus_framework.exe    N/A      |
|    0   N/A  N/A     18272    C+G   ...crosoft\Edge\Application\msedge.exe    N/A      |
|    0   N/A  N/A     18468    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe    N/A      |
|    0   N/A  N/A     18724    C+G   ...\cef\cef.win7x64\steamwebhelper.exe    N/A      |
|    0   N/A  N/A     19308    C+G   ...les\Microsoft OneDrive\OneDrive.exe    N/A      |
|    0   N/A  N/A     21196    C+G   ...l\Microsoft\Teams\current\Teams.exe    N/A      |
|    0   N/A  N/A     21588    C+G   ...7\extracted\runtime\WeChatAppEx.exe    N/A      |
|    0   N/A  N/A     22720    C+G   ...2txyewy\StartMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     23624    C+G   ...YNK4LKTQIF2SCK2UYOCE7A2AQ\DeepL.exe    N/A      |
|    0   N/A  N/A     23632    C+G   C:\Program Files\LGHUB\lghub.exe          N/A      |
|    0   N/A  N/A     23652    C+G   ...GeForce Experience\NVIDIA Share.exe    N/A      |
|    0   N/A  N/A     25552    C+G   ...siveControlPanel\SystemSettings.exe    N/A      |
|    0   N/A  N/A     28012    C+G   ...5n1h2txyewy\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A     28312    C+G   ...03.0_x64__8wekyb3d8bbwe\Cortana.exe    N/A      |
|    0   N/A  N/A     28776    C+G   ....5.49.0_x64__htrsf667h5kn2\AWCC.exe    N/A      |
|    0   N/A  N/A     29676    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe    N/A      |
|    0   N/A  N/A     30600    C+G   ...les\Microsoft OneDrive\OneDrive.exe    N/A      |
|    0   N/A  N/A     32664    C+G   C:\Windows\explorer.exe                   N/A      |
|    0   N/A  N/A     34220    C+G   ...on\wallpaper_engine\wallpaper32.exe    N/A      |
|    0   N/A  N/A     34268    C+G   ...m\radeonsoftware\RadeonSoftware.exe    N/A      |
|    0   N/A  N/A     34416    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A     35140    C+G   ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe    N/A      |
|    0   N/A  N/A     35320    C+G   ...B\system_tray\lghub_system_tray.exe    N/A      |
|    0   N/A  N/A     35768    C+G   ...x64__qmba6cd70vzyy\ArmouryCrate.exe    N/A      |
+---------------------------------------------------------------------------------------+

May I ask it that maybe the problem of CUDA version ? or Ray will automatically detect GPU. Cause I remember I degrade my CUDA version from 12.2 to 11.8 since I thought I need downloaded the matched version of CUDA and pytorch. But I dont know why it still shows me CUDA Version: 12.2 in terminal.

Thanks for replying in advance.

ArturNiederfahrenhorst commented 1 year ago

May I ask it that maybe the problem of CUDA version ? or Ray will automatically detect GPU. Cause I remember I degrade my CUDA version from 12.2 to 11.8 since I thought I need downloaded the matched version of CUDA and pytorch. But I dont know why it still shows me CUDA Version: 12.2 in terminal.

No, Ray/RLlib does not care for the cuda version.

I've copied your script, excluding the environment which I don't have and modifying the policy specs, which should not have any influence on resource allocation to get the following:

import gymnasium
from ray.tune.registry import register_env

import ray

import ray.air
from ray.rllib.algorithms.ppo import PPOConfig

from ray.rllib.policy.policy import PolicySpec
from ray.tune.stopper import MaximumIterationStopper

# Create an RLlib config using multi-agent PPO on mobile-env's small scenario.
config = (
    PPOConfig()
    .environment(env="CartPole-v1")
    .framework("torch")
    # RLlib needs +1 CPU than configured below (for the driver/traininer?)
    .resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
    .rollouts(num_rollout_workers=15)
)

# Create the Trainer/Tuner and define how long to train
tuner = ray.tune.Tuner(
    "PPO",
    run_config=ray.air.RunConfig(
        # Save the training progress and checkpoints locally under the specified subfolder.
        storage_path="./DTDE_medium_tests",
        # Control training length by setting the number of iterations. 1 iter = 4000 time steps by default.
        stop=MaximumIterationStopper(max_iter=10),
        checkpoint_config=ray.air.CheckpointConfig(checkpoint_at_end=True),
    ),
    param_space=config,
)

# Run training and save the result
result_grid = tuner.fit()

I This script does not run with any issues if you remove ray.init(), which I suggested in my first answer. I'm afraid I can not help here if I can not reproduce the issue with the above script.

R0B1NNN1 commented 1 year ago

Hi, I run the code you uploaded above. It shows me the same problem. But never mind, I am trying to work it out. It is quite strange since this works in Colab, but does not work in my own labtop.

ArturNiederfahrenhorst commented 1 year ago

Thanks. I'm going to close this issue for now! Please let us know if/how you resolve this here 🙂

luzgui commented 2 months ago

Hello, any development in this issue? I am having the same problem. Everything is installed and ray, nvidia-smi and TF recognize the GPU but that error occurs