Trying to run on cuda:1 crashes

Robokan commented 1 year ago

**I have 2 GPU's and I want to only train on the second one so I ran:

python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1'

it crashes saying I am still running something on cuda:0. Any ideas how to fix this?

here is the full stack trace:**

(rlenv) bizon@dl:~/eric/IsaacGymEnvs-main/isaacgymenvs$ python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1' Importing module 'gym_38' (/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/linux-x86_64/gym_38.so) Setting GYM_USD_PLUG_INFO_PATH to /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json train.py:49: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_name="config", config_path="./cfg") /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing _self_. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/defaults_list.py:415: UserWarning: In config: Invalid overriding of hydra/job_logging: Default list overrides requires 'override' keyword. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/defaults_list_override for more information.

deprecation_warning(msg) /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( PyTorch version 1.13.1 Device count 2 /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/src/gymtorch Using /home/bizon/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Emitting ninja build file /home/bizon/.cache/torch_extensions/py38_cu117/gymtorch/build.ninja... Building extension module gymtorch... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module gymtorch... /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/torch_utils.py:135: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def get_axis_params(value, axis_idx, x_value=0., dtype=np.float, n_dims=3): 2023-04-14 09:06:54,989 - INFO - logger - logger initialized

:3: DeprecationWarning: invalid escape sequence \* Error: FBX library failed to load - importing FBX data will not succeed. Message: No module named 'fbx' FBX tools must be installed from https://help.autodesk.com/view/FBX/2020/ENU/?guid=FBX_Developer_Help_scripting_with_python_fbx_installing_python_fbx_html /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if not hasattr(tensorboard, "__version__") or LooseVersion( /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:568: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations (np.object, string), /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:569: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations (np.bool, bool), /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:100: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations np.object: SlowAppendObjectArrayToTensorProto, /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:101: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations np.bool: SlowAppendBoolArrayToTensorProto, task: name: Cartpole physics_engine: physx env: numEnvs: 512 envSpacing: 4.0 resetDist: 3.0 maxEffort: 400.0 clipObservations: 5.0 clipActions: 1.0 asset: assetRoot: ../../assets assetFileName: urdf/cartpole.urdf enableCameraSensors: False sim: dt: 0.0166 substeps: 2 up_axis: z use_gpu_pipeline: True gravity: [0.0, 0.0, -9.81] physx: num_threads: 4 solver_type: 1 use_gpu: True num_position_iterations: 4 num_velocity_iterations: 0 contact_offset: 0.02 rest_offset: 0.001 bounce_threshold_velocity: 0.2 max_depenetration_velocity: 100.0 default_buffer_size_multiplier: 2.0 max_gpu_contact_pairs: 1048576 num_subscenes: 4 contact_collection: 0 task: randomize: False train: params: seed: 42 algo: name: a2c_continuous model: name: continuous_a2c_logstd network: name: actor_critic separate: False space: continuous: mu_activation: None sigma_activation: None mu_init: name: default sigma_init: name: const_initializer val: 0 fixed_sigma: True mlp: units: [32, 32] activation: elu initializer: name: default regularizer: name: None load_checkpoint: False load_path: config: name: Cartpole full_experiment_name: Cartpole env_name: rlgpu ppo: True mixed_precision: False normalize_input: True normalize_value: True num_actors: 512 reward_shaper: scale_value: 0.1 normalize_advantage: True gamma: 0.99 tau: 0.95 learning_rate: 0.0003 lr_schedule: adaptive kl_threshold: 0.008 score_to_win: 20000 max_epochs: 100 save_best_after: 50 save_frequency: 25 grad_norm: 1.0 entropy_coef: 0.0 truncate_grads: True e_clip: 0.2 horizon_length: 16 minibatch_size: 8192 mini_epochs: 8 critic_coef: 4 clip_value: True seq_len: 4 bounds_loss_coef: 0.0001 task_name: Cartpole experiment: num_envs: seed: 42 torch_deterministic: False max_iterations: physics_engine: physx pipeline: gpu sim_device: cuda:1 rl_device: cuda:1 graphics_device_id: 0 num_threads: 4 solver_type: 1 num_subscenes: 4 test: False checkpoint: multi_gpu: False wandb_activate: False wandb_group: wandb_name: Cartpole wandb_entity: wandb_project: isaacgymenvs capture_video: False capture_video_freq: 1464 capture_video_len: 100 force_render: True headless: False Setting seed: 42 self.seed = 42 Started to train Exact experiment name requested from command line: Cartpole /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32 logger.warn(f"Box bound precision lowered by casting to {self.dtype}") [Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes) Not connected to PVD +++ Using GPU PhysX Physics Engine: PhysX Physics Device: cuda:1 GPU Pipeline: enabled Box(-1.0, 1.0, (1,), float32) Box(-inf, inf, (4,), float32) current training device: cuda:0 build mlp: 4 RunningMeanStd: (1,) RunningMeanStd: (4,) Error executing job with overrides: ['task=Cartpole', 'rl_device=cuda:1', 'sim_device=cuda:1'] Traceback (most recent call last): File "train.py", line 161, in launch_rlg_hydra runner.run({ File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run self.run_train(args) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train agent.train() File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1173, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1037, in train_epoch batch_dict = self.play_steps() File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 626, in play_steps res_dict = self.get_action_values(self.obs) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 348, in get_action_values res_dict = self.model(input_dict) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/models.py", line 246, in forward input_dict['obs'] = self.norm_obs(input_dict['obs']) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/models.py", line 49, in norm_obs return self.running_mean_std(observation) if self.normalize_input else observation File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/running_mean_std.py", line 79, in forward y = (input - current_mean.float()) / torch.sqrt(current_var.float() + self.epsilon) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

utomm commented 1 year ago

Hi, I encountered the same issue, and according to #109, it is because rl_device='cuda:1' doesn't work correctly.

you can either follow their solution or simply add CUDA_VISIBLE_DEVICES=[gpu_ids] infront of your training command.

Robokan commented 1 year ago

I just tried

CUDA_VISIBLE_DEVICES=1, python train.py task=Cartpole CUDA_VISIBLE_DEVICES=1, python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1' CUDA_VISIBLE_DEVICES=[1], python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1' CUDA_VISIBLE_DEVICES=[1], python train.py task=Cartpole

None of these work. It still crashes. I tried just using export as well. Were you able to get it to work?

MatPoliquin commented 1 year ago

@Robokan If you use CUDA_VISIBLE_DEVICES=1, you need to use cuda:0 instead of cuda:1 since you now have only one GPU exposed

Robokan commented 1 year ago

Great that works. Thanks for the clarification.

isaac-sim / IsaacGymEnvs

Trying to run on cuda:1 crashes #129