Closed Privilger closed 3 months ago
You have to use SLURM and design your code to leverage multiple GPUs, not all RL frameworks let u do it.
I think this is their example on how to use the Singularity containers. The Docker container is OCI, Singularity is SIF. Different standards of containers, https://docs.sylabs.io/guides/3.5/user-guide/introduction.html
It is a little different and has its challenges, I hope this information helps you.
docker/cluster https://github.com/NVIDIA-Omniverse/orbit/tree/main/docker/cluster
I think the methods you mentioned are about using cluster and container.
How about multi GPU in one workstation? The torchrun works well with rl games framework. But can not run well in orbit framework. So I think this is more about configuration to allow orbit works well with multiGPU
This is about computer architecture, OS architecture and parallel computing architecture. multiGPU and even multi-core CPU is Parallel Computing regardless how you choose to orchestrate it. Because the particular way how Nvidia Omniverse Isaac Sim is designed and how Orbit works on it you should use SLURM for achieving the maximum performance.
If multi GPU with multi machine, using container and SLURM might be an solution.
However, if the case is multi GPU in one machine, I believe SLURM is not the way to do that.
It is nothing to do with "multiGPU and even multi-core CPU" that you mentioned.
torchrun should solve the problem in this case. I just do not know how to figure it out yet.
This repo gives the example using torchrun with isaacsim https://github.com/NVIDIA-Omniverse/OmniIsaacGymEnvs
python source/standalone/workflows/rl_games/train.py --task Isaac-Reach-UR10-v0 --headless
try task Isaac-Ant-v0
my task is robot,ant is ok
This should work properly now, see: https://isaac-sim.github.io/IsaacLab/source/features/multi_gpu.html
Question
Hi,
The tutorial gives a script to show how to train rl policy using rl game framework. source/standalone/workflows/rl_games/train.py
But the example seems only works in one gpu.
I saw the rl game(https://github.com/Denys88/rl_games/tree/master) can use torchrun to leverage multi gpu, but how can I use torchrun here(the orbit framework). I try to run command :
torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py
, but seems not working.