ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.89k stars 5.76k forks source link

|RLlib] New API Stack: "local_gpu_idx 0 is not a valid GPU id or is not available." #47364

Open PhilippWillms opened 2 months ago

PhilippWillms commented 2 months ago

EDIT: Issue identified in ray 2.34 release, fixed a linting topic in repro script.

What happened + What you expected to happen

Configuring Windows anaconda environment with set CUDA_VISIBLE_DEVICES='1', as I have one physical GPU core. Then running script below leads to following error strack trace

"name": "AssertionError", "message": "local_gpu_idx 0 is not a valid GPU id or is not available.", "stack": "--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[8], line 4 1 checkpoint_at_every_iter = 5 2 iteration_count = 20 ----> 4 trainer = config.build() 6 results = [] 7 for i in tqdm(range(1,iteration_count+1)):

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm_config.py:882, in AlgorithmConfig.build(self, env, logger_creator, use_copy) 879 if isinstance(self.algo_class, str): 880 algo_class = get_trainable_cls(self.algo_class) --> 882 return algo_class( 883 config=self if not use_copy else copy.deepcopy(self), 884 logger_creator=self.logger_creator, 885 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm.py:571, in Algorithm.init(self, config, env, logger_creator, kwargs) 568 # Evaluation EnvRunnerGroup and metrics last returned by self.evaluate(). 569 self.eval_env_runner_group: Optional[EnvRunnerGroup] = None --> 571 super().init( 572 config=config, 573 logger_creator=logger_creator, 574 kwargs, 575 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\tune\trainable\trainable.py:158, in Trainable.init(self, config, logger_creator, storage) 154 logger.debug(f\"StorageContext on the TRAINABLE:\ {storage}\") 156 self._open_logfiles(stdout_file, stderr_file) --> 158 self.setup(copy.deepcopy(self.config)) 159 setup_time = time.time() - self._start_time 160 if setup_time > SETUP_TIME_THRESHOLD:

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm.py:801, in Algorithm.setup(self, config) 796 else: 797 raise AttributeError( 798 \"Your local EnvRunner/RolloutWorker does NOT have any property \" 799 \"referring to its RLModule!\" 800 ) --> 801 self.learner_group = self.config.build_learner_group( 802 rl_module_spec=module_spec 803 ) 805 # Check if there are modules to load from the module_spec. 806 rl_module_ckpt_dirs = {}

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\algorithm_config.py:1143, in AlgorithmConfig.build_learner_group(self, env, spaces, rl_module_spec) 1140 rl_module_spec = self.get_multi_rl_module_spec(env=env, spaces=spaces) 1142 # Construct the actual LearnerGroup. -> 1143 learner_group = LearnerGroup(config=self.copy(), module_spec=rl_module_spec) 1145 return learner_group

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\core\learner\learner_group.py:133, in LearnerGroup.init(self, config, module_spec) 131 if not self.is_remote: 132 self._learner = learner_class(config=config, module_spec=module_spec) --> 133 self._learner.build() 134 self._worker_manager = None 135 # N remote Learner workers. 136 else:

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\algorithms\ppo\ppo_learner.py:36, in PPOLearner.build(self) 34 @override(Learner) 35 def build(self) -> None: ---> 36 super().build() 38 # Dict mapping module IDs to the respective entropy Scheduler instance. 39 self.entropy_coeff_schedulers_per_module: Dict[ 40 ModuleID, Scheduler 41 ] = LambdaDefaultDict( (...) 48 ) 49 )

File c:\Users\Philipp\anaconda3\envs\py311-raynew\Lib\site-packages\ray\rllib\core\learner\torch\torch_learner.py:308, in TorchLearner.build(self) 306 self._device = devices[0] 307 else: --> 308 assert self._local_gpu_idx < torch.cuda.device_count(), ( 309 f\"local_gpu_idx {self._local_gpu_idx} is not a valid GPU id or is \" 310 \" not available.\" 311 ) 312 # this is an index into the available cuda devices. For example if 313 # os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"1\" then 314 # torch.cuda.device_count() = 1 and torch.device(0) will actuall map to 315 # the gpu with id 1 on the node. 316 self._device = torch.device(self._local_gpu_idx)

AssertionError: local_gpu_idx 0 is not a valid GPU id or is not available."

Versions / Dependencies

ray==2.34 python==3.11.9

Reproduction script

config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment(
        env=ActionMaskEnv,
        env_config={
            "action_space": Discrete(100),
            "observation_space": Box(-1.0, 1.0, (5,)),
        },
    )
    .framework("torch")
    .resources(num_gpus=1) 
    .learners(num_learners=0, num_gpus_per_learner=1)
    .env_runners(
        num_env_runners=4, 
        num_cpus_per_env_runner=1,
        batch_mode="complete_episodes",
    )
    .rl_module(
        model_config_dict={
            "post_fcnet_hiddens": [64, 64],
            "post_fcnet_activation": "relu",
        },
        rl_module_spec=RLModuleSpec(
            module_class=ActionMaskingTorchRLModule,
        ),
    )
    .evaluation(
        evaluation_num_env_runners=1,
        evaluation_interval=1,  
        evaluation_parallel_to_training=True,
    ) 
)

checkpoint_at_every_iter = 5
iteration_count = 20
trainer = config.build()
results = []
for i in tqdm(range(1,iteration_count+1)):
    res = trainer.train()
    if i % checkpoint_at_every_iter == 0:
        path_to_checkpoint = trainer.save()        
        print(f"Checkpoint saved at {path_to_checkpoint}")
    results.append(res)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

PhilippWillms commented 2 months ago

@simonsays1980 , @sven1977 : This topic only occurs when using config.build().train() . tune does not run into that issue.

simonsays1980 commented 1 month ago

@PhilippWillms Thanks for raising this issue. Its hard to reproduce as we do not have this hardware setup.

PhilippWillms commented 1 month ago

@simonsays1980 : What I learnt today based on your comment: Correct torch version is installed and torch.cuda.is_available() returns True and torch.cuda.device_count() returns 1 .... UNLESS you do not set environment variable CUDA_VISIBLE_DEVICES in the anaconda environment. Leave it blank and torch detects GPU correctly.