Closed R0B1NNN1 closed 1 year ago
Do you have a GPU? What happens if you don't call ray.init() (this is done by RLlib under the hood already)?
@ArturNiederfahrenhorst Hi, thanks for replying.
Yes, I do have a GPU. I used :
import GPUtil
gpus = GPUtil.getGPUs()
print("Num GPUs Available:", len(gpus))
and it shows me i have one gpu. And I tried to run it without ray init, it still show me same problem. It that some problem with the cuda ?
Thanks.
Please post your entire reproduction script.
Also, please run nvidia-smi
and see if it runs to your expectations.
import gymnasium
from ray.tune.registry import register_env
# use the mobile-env RLlib wrapper for RLlib
def register(config):
# importing mobile_env registers the included environments
import mobile_env
from mobile_env.wrappers.multi_agent import RLlibMAWrapper
env = gymnasium.make("mobile-medium-ma-v0")
return RLlibMAWrapper(env)
# register the predefined scenario with RLlib
register_env("mobile-medium-ma-v0", register)
import ray
# init ray with available CPUs (and GPUs) and init ray
ray.init(
num_cpus=12, # change to your available number of CPUs
num_gpus=1,
include_dashboard=False,
ignore_reinit_error=True,
log_to_driver=False,
)
import ray.air
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.policy.policy import PolicySpec
from ray.tune.stopper import MaximumIterationStopper
# Create an RLlib config using multi-agent PPO on mobile-env's small scenario.
config = (
PPOConfig()
.environment(env="mobile-medium-ma-v0")
.framework("torch")
# Here, we configure all agents to share the same policy.
.multi_agent(
policies={
'policy_agent0': PolicySpec(),
'policy_agent1': PolicySpec(),
'policy_agent2': PolicySpec(),
'policy_agent3': PolicySpec(),
'policy_agent4': PolicySpec(),
'policy_agent5': PolicySpec(),
'policy_agent6': PolicySpec(),
'policy_agent7': PolicySpec(),
'policy_agent8': PolicySpec(),
'policy_agent9': PolicySpec(),
'policy_agent10': PolicySpec(),
'policy_agent11': PolicySpec(),
'policy_agent12': PolicySpec(),
'policy_agent13': PolicySpec(),
'policy_agent14': PolicySpec()
},
policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: 'policy_agent' + str(agent_id),
)
# RLlib needs +1 CPU than configured below (for the driver/traininer?)
.resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
.rollouts(num_rollout_workers=15)
)
# Create the Trainer/Tuner and define how long to train
tuner = ray.tune.Tuner(
"PPO",
run_config=ray.air.RunConfig(
# Save the training progress and checkpoints locally under the specified subfolder.
storage_path="./DTDE_medium_tests",
# Control training length by setting the number of iterations. 1 iter = 4000 time steps by default.
stop=MaximumIterationStopper(max_iter=10),
checkpoint_config=ray.air.CheckpointConfig(checkpoint_at_end=True),
),
param_space=config,
)
# Run training and save the result
result_grid = tuner.fit()
And when I run ``nvidia-smi
Fri Aug 4 17:51:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... WDDM | 00000000:01:00.0 On | N/A |
| N/A 57C P8 18W / 130W | 1488MiB / 8192MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1100 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A |
| 0 N/A N/A 3924 C+G ...GeForce Experience\NVIDIA Share.exe N/A |
| 0 N/A N/A 8408 C+G ...les\Microsoft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 10112 C+G ...nr4m\radeonsoftware\AMDRSSrcExt.exe N/A |
| 0 N/A N/A 10580 C+G ...Programs\Microsoft VS Code\Code.exe N/A |
| 0 N/A N/A 11540 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 11852 C+G ...n\115.0.1901.188\msedgewebview2.exe N/A |
| 0 N/A N/A 16464 C+G ...paper_engine\bin\webwallpaper32.exe N/A |
| 0 N/A N/A 16960 C+G ...l\Microsoft\Teams\current\Teams.exe N/A |
| 0 N/A N/A 17288 C+G ...US\ArmouryDevice\asus_framework.exe N/A |
| 0 N/A N/A 18272 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 18468 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A |
| 0 N/A N/A 18724 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A |
| 0 N/A N/A 19308 C+G ...les\Microsoft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 21196 C+G ...l\Microsoft\Teams\current\Teams.exe N/A |
| 0 N/A N/A 21588 C+G ...7\extracted\runtime\WeChatAppEx.exe N/A |
| 0 N/A N/A 22720 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 23624 C+G ...YNK4LKTQIF2SCK2UYOCE7A2AQ\DeepL.exe N/A |
| 0 N/A N/A 23632 C+G C:\Program Files\LGHUB\lghub.exe N/A |
| 0 N/A N/A 23652 C+G ...GeForce Experience\NVIDIA Share.exe N/A |
| 0 N/A N/A 25552 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 28012 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 28312 C+G ...03.0_x64__8wekyb3d8bbwe\Cortana.exe N/A |
| 0 N/A N/A 28776 C+G ....5.49.0_x64__htrsf667h5kn2\AWCC.exe N/A |
| 0 N/A N/A 29676 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 30600 C+G ...les\Microsoft OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 32664 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 34220 C+G ...on\wallpaper_engine\wallpaper32.exe N/A |
| 0 N/A N/A 34268 C+G ...m\radeonsoftware\RadeonSoftware.exe N/A |
| 0 N/A N/A 34416 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 35140 C+G ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A |
| 0 N/A N/A 35320 C+G ...B\system_tray\lghub_system_tray.exe N/A |
| 0 N/A N/A 35768 C+G ...x64__qmba6cd70vzyy\ArmouryCrate.exe N/A |
+---------------------------------------------------------------------------------------+
May I ask it that maybe the problem of CUDA version ? or Ray will automatically detect GPU. Cause I remember I degrade my CUDA version from 12.2 to 11.8 since I thought I need downloaded the matched version of CUDA and pytorch. But I dont know why it still shows me CUDA Version: 12.2 in terminal.
Thanks for replying in advance.
May I ask it that maybe the problem of CUDA version ? or Ray will automatically detect GPU. Cause I remember I degrade my CUDA version from 12.2 to 11.8 since I thought I need downloaded the matched version of CUDA and pytorch. But I dont know why it still shows me CUDA Version: 12.2 in terminal.
No, Ray/RLlib does not care for the cuda version.
I've copied your script, excluding the environment which I don't have and modifying the policy specs, which should not have any influence on resource allocation to get the following:
import gymnasium
from ray.tune.registry import register_env
import ray
import ray.air
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.policy.policy import PolicySpec
from ray.tune.stopper import MaximumIterationStopper
# Create an RLlib config using multi-agent PPO on mobile-env's small scenario.
config = (
PPOConfig()
.environment(env="CartPole-v1")
.framework("torch")
# RLlib needs +1 CPU than configured below (for the driver/traininer?)
.resources(num_cpus_per_worker=1, num_gpus_per_worker=1/16)
.rollouts(num_rollout_workers=15)
)
# Create the Trainer/Tuner and define how long to train
tuner = ray.tune.Tuner(
"PPO",
run_config=ray.air.RunConfig(
# Save the training progress and checkpoints locally under the specified subfolder.
storage_path="./DTDE_medium_tests",
# Control training length by setting the number of iterations. 1 iter = 4000 time steps by default.
stop=MaximumIterationStopper(max_iter=10),
checkpoint_config=ray.air.CheckpointConfig(checkpoint_at_end=True),
),
param_space=config,
)
# Run training and save the result
result_grid = tuner.fit()
I This script does not run with any issues if you remove ray.init(), which I suggested in my first answer. I'm afraid I can not help here if I can not reproduce the issue with the above script.
Hi, I run the code you uploaded above. It shows me the same problem. But never mind, I am trying to work it out. It is quite strange since this works in Colab, but does not work in my own labtop.
Thanks. I'm going to close this issue for now! Please let us know if/how you resolve this here 🙂
Hello, any development in this issue? I am having the same problem. Everything is installed and ray, nvidia-smi and TF recognize the GPU but that error occurs
What happened + What you expected to happen
Hi,
I am trying to use gpu to accelerate the training process. And here is some parts of my codes.
and
And it shows me the following bugs. I am wondering why? Thanks for replying in advance.
Versions / Dependencies
Ray = 2.5.1 Python = 3.9.17
Reproduction script
and
Issue Severity
None