|RLlib] New API Stack: "local_gpu_idx 0 is not a valid GPU id or is not available."

PhilippWillms commented 2 months ago

EDIT: Issue identified in ray 2.34 release, fixed a linting topic in repro script.

What happened + What you expected to happen

Configuring Windows anaconda environment with set CUDA_VISIBLE_DEVICES='1', as I have one physical GPU core. Then running script below leads to following error strack trace

"name": "AssertionError", "message": "local_gpu_idx 0 is not a valid GPU id or is not available.", "stack": "--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[8], line 4 1 checkpoint_at_every_iter = 5 2 iteration_count = 20 ----> 4 trainer = config.build() 6 results = [] 7 for i in tqdm(range(1,iteration_count+1)):

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm_config.py:882, in AlgorithmConfig.build(self, env, logger_creator, use_copy) 879 if isinstance(self.algo_class, str): 880 algo_class = get_trainable_cls(self.algo_class) --> 882 return algo_class( 883 config=self if not use_copy else copy.deepcopy(self), 884 logger_creator=self.logger_creator, 885 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm.py:571, in Algorithm.init(self, config, env, logger_creator, kwargs) 568 # Evaluation EnvRunnerGroup and metrics last returned by self.evaluate(). 569 self.eval_env_runner_group: Optional[EnvRunnerGroup] = None --> 571 super().init( 572 config=config, 573 logger_creator=logger_creator, 574 kwargs, 575 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\tune\trainable\trainable.py:158, in Trainable.init(self, config, logger_creator, storage) 154 logger.debug(f\"StorageContext on the TRAINABLE:\ {storage}\") 156 self._open_logfiles(stdout_file, stderr_file) --> 158 self.setup(copy.deepcopy(self.config)) 159 setup_time = time.time() - self._start_time 160 if setup_time > SETUP_TIME_THRESHOLD:

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm.py:801, in Algorithm.setup(self, config) 796 else: 797 raise AttributeError( 798 \"Your local EnvRunner/RolloutWorker does NOT have any property \" 799 \"referring to its RLModule!\" 800 ) --> 801 self.learner_group = self.config.build_learner_group( 802 rl_module_spec=module_spec 803 ) 805 # Check if there are modules to load from the module_spec. 806 rl_module_ckpt_dirs = {}

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm_config.py:1143, in AlgorithmConfig.build_learner_group(self, env, spaces, rl_module_spec) 1140 rl_module_spec = self.get_multi_rl_module_spec(env=env, spaces=spaces) 1142 # Construct the actual LearnerGroup. -> 1143 learner_group = LearnerGroup(config=self.copy(), module_spec=rl_module_spec) 1145 return learner_group

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\core\learner\learner_group.py:133, in LearnerGroup.init(self, config, module_spec) 131 if not self.is_remote: 132 self._learner = learner_class(config=config, module_spec=module_spec) --> 133 self._learner.build() 134 self._worker_manager = None 135 # N remote Learner workers. 136 else:

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\ppo\ppo_learner.py:36, in PPOLearner.build(self) 34 @override(Learner) 35 def build(self) -> None: ---> 36 super().build() 38 # Dict mapping module IDs to the respective entropy Scheduler instance. 39 self.entropy_coeff_schedulers_per_module: Dict[ 40 ModuleID, Scheduler 41 ] = LambdaDefaultDict( (...) 48 ) 49 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\core\learner\torch\torch_learner.py:308, in TorchLearner.build(self) 306 self._device = devices[0] 307 else: --> 308 assert self._local_gpu_idx < torch.cuda.device_count(), ( 309 f\"local_gpu_idx {self._local_gpu_idx} is not a valid GPU id or is \" 310 \" not available.\" 311 ) 312 # this is an index into the available cuda devices. For example if 313 # os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"1\" then 314 # torch.cuda.device_count() = 1 and torch.device(0) will actuall map to 315 # the gpu with id 1 on the node. 316 self._device = torch.device(self._local_gpu_idx)

AssertionError: local_gpu_idx 0 is not a valid GPU id or is not available."

Versions / Dependencies

ray==2.34 python==3.11.9

Reproduction script

config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment(
        env=ActionMaskEnv,
        env_config={
            "action_space": Discrete(100),
            "observation_space": Box(-1.0, 1.0, (5,)),
        },
    )
    .framework("torch")
    .resources(num_gpus=1) 
    .learners(num_learners=0, num_gpus_per_learner=1)
    .env_runners(
        num_env_runners=4, 
        num_cpus_per_env_runner=1,
        batch_mode="complete_episodes",
    )
    .rl_module(
        model_config_dict={
            "post_fcnet_hiddens": [64, 64],
            "post_fcnet_activation": "relu",
        },
        rl_module_spec=RLModuleSpec(
            module_class=ActionMaskingTorchRLModule,
        ),
    )
    .evaluation(
        evaluation_num_env_runners=1,
        evaluation_interval=1,  
        evaluation_parallel_to_training=True,
    ) 
)

checkpoint_at_every_iter = 5
iteration_count = 20
trainer = config.build()
results = []
for i in tqdm(range(1,iteration_count+1)):
    res = trainer.train()
    if i % checkpoint_at_every_iter == 0:
        path_to_checkpoint = trainer.save()        
        print(f"Checkpoint saved at {path_to_checkpoint}")
    results.append(res)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

PhilippWillms commented 2 months ago

@simonsays1980 , @sven1977 : This topic only occurs when using config.build().train() . tune does not run into that issue.

simonsays1980 commented 1 month ago

@PhilippWillms Thanks for raising this issue. Its hard to reproduce as we do not have this hardware setup.

Can you ensure that you installed the correct torch version as shown here?
What does torch.cuda.device_count() return for you?

PhilippWillms commented 1 month ago

@simonsays1980 : What I learnt today based on your comment: Correct torch version is installed and torch.cuda.is_available() returns True and torch.cuda.device_count() returns 1 .... UNLESS you do not set environment variable CUDA_VISIBLE_DEVICES in the anaconda environment. Leave it blank and torch detects GPU correctly.

ray-project / ray