pytorch / rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
https://pytorch.org/rl
MIT License
2.35k stars 310 forks source link

[Feature Request] multiple GPUs on a single machine #2160

Open sgfCrazy opened 6 months ago

sgfCrazy commented 6 months ago

Motivation

I want to train using multiple GPUs on a single machine, but I can't find relevant tutorial documentation.

Could you provide an example of training using multiple GPUs on a single machine? For instance, updating the network on cuda:0 while gathering data on cuda:1?

please, thanks.

Solution

A clear and concise description of what you want to happen.

Alternatives

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Checklist

sgfCrazy commented 6 months ago

When I run this script(https://github.com/pytorch/rl/blob/v0.3.1/examples/distributed/collectors/single_machine/generic.py), it reports the following error. A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7) [Powered by Stella] /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers. logger.warn( /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers. logger.warn( /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers. logger.warn( /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers. logger.warn( [{'device': 'cuda:1', 'storing_device': 'cuda:1'}] cuda:0 A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7) [Powered by Stella] /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.num_envs for environment variables or env.get_wrapper_attr('num_envs') that will search the reminding wrappers. logger.warn( /root/miniconda3/envs/gfkd/lib/python3.9/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.reward_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do env.unwrapped.reward_space for environment variables or env.get_wrapper_attr('reward_space') that will search the reminding wrappers. logger.warn( 0%| | 0/3000000 [00:00<?, ?it/s]Traceback (most recent call last): File "/cephfs/PERSONAL/usr/chenjiaxin/sgf/code/gfkd/DHPT/tests/test_multi_gpu_one_machine.py", line 162, in for i, data in enumerate(collector): File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 783, in iterator yield from self._iterator_dist() File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torchrl/collectors/distributed/generic.py", line 799, in _iterator_dist self._tensordict_out[i].irecv(src=rank, return_premature=True) File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3613, in irecv return self._irecv( File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/tensordict/base.py", line 3654, in _irecv _future_list.append(dist.irecv(value, src=src, tag=_tag, group=group)) File "/root/miniconda3/envs/gfkd/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1628, in irecv return pg.recv([tensor], src, tag) RuntimeError: No backend type associated with device type cpu

sgfCrazy commented 6 months ago

system info:

import torchrl, tensordict, torch, numpy, sys print(torch.version, tensordict.version, torchrl.version, numpy.version, sys.version, sys.platform)

2.2.1+cu121 0.3.1 0.3.1 1.26.4 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] linux