[Question] How to train by rl games with multi gpu in one machine?

isaac-sim / IsaacLab

Unified framework for robot learning built on NVIDIA Isaac Sim

https://isaac-sim.github.io/IsaacLab

Other

1.91k stars 735 forks source link

[Question] How to train by rl games with multi gpu in one machine? #272

Closed Privilger closed 3 months ago

Privilger commented 6 months ago

Question

Hi,

The tutorial gives a script to show how to train rl policy using rl game framework. source/standalone/workflows/rl_games/train.py

But the example seems only works in one gpu.

I saw the rl game(https://github.com/Denys88/rl_games/tree/master) can use torchrun to leverage multi gpu, but how can I use torchrun here(the orbit framework). I try to run command : torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py, but seems not working.

renanmb commented 6 months ago

You have to use SLURM and design your code to leverage multiple GPUs, not all RL frameworks let u do it.

renanmb commented 6 months ago

I think this is their example on how to use the Singularity containers. The Docker container is OCI, Singularity is SIF. Different standards of containers, https://docs.sylabs.io/guides/3.5/user-guide/introduction.html

It is a little different and has its challenges, I hope this information helps you.

docker/cluster https://github.com/NVIDIA-Omniverse/orbit/tree/main/docker/cluster

Privilger commented 6 months ago

I think the methods you mentioned are about using cluster and container.

How about multi GPU in one workstation? The torchrun works well with rl games framework. But can not run well in orbit framework. So I think this is more about configuration to allow orbit works well with multiGPU

renanmb commented 6 months ago

This is about computer architecture, OS architecture and parallel computing architecture. multiGPU and even multi-core CPU is Parallel Computing regardless how you choose to orchestrate it. Because the particular way how Nvidia Omniverse Isaac Sim is designed and how Orbit works on it you should use SLURM for achieving the maximum performance.

Privilger commented 6 months ago

If multi GPU with multi machine, using container and SLURM might be an solution.

However, if the case is multi GPU in one machine, I believe SLURM is not the way to do that.

It is nothing to do with "multiGPU and even multi-core CPU" that you mentioned.

torchrun should solve the problem in this case. I just do not know how to figure it out yet.

This repo gives the example using torchrun with isaacsim https://github.com/NVIDIA-Omniverse/OmniIsaacGymEnvs

Nimingez commented 6 months ago

python source/standalone/workflows/rl_games/train.py --task Isaac-Reach-UR10-v0 --headless

Privilger commented 6 months ago

try task Isaac-Ant-v0

Nimingez commented 6 months ago

my task is robot，ant is ok

Dhoeller19 commented 3 months ago

This should work properly now, see: https://isaac-sim.github.io/IsaacLab/source/features/multi_gpu.html