Open lostmsu opened 3 years ago
@lostmsu Thank you for your PR. As to reduce trait bounds for observation in agents like DQN, I added TchBufferOnDevice
trait, which is responsible for transferring data to the device of the model.
However, I don't see any speed up with a replay buffer on GPU in an example (sac_ant_gpu
). It might be due to small memory footprint of vector observation of the Ant environment. Though the effect of GPU for replay buffer could be significant when the memory footprint of observation is large, like Atari environment with stacked images, the limited GPU memory does not allow us to use a large capacity of replay buffer (Atari envs requires about 20〜30GB of memory for capacity of 1,000,000 samples). If you have cases where replay buffer on GPU works, it would be nice for me to share information.
@taku-y I tried your sac_ant_gpu
, and I get about 3% FPS improvement from using GPU replay buffer (I also had to increase batch size to 4096 to put some noticeable load on GPU).
With GPU buffer:
[2021-08-12T04:50:39Z INFO border_core::core::trainer] Opt step 20000, Eval (mean, min, max) of r_sum: 422.5705, 347.70502, 501.35193 [2021-08-12T04:50:39Z INFO border_core::core::trainer] 129.4951 FPS in training [2021-08-12T04:50:39Z INFO border_core::core::trainer] 4.607 sec. in evaluation [2021-08-12T04:51:58Z INFO border_core::core::util] Episode 0, 999 steps, reward = 557.7083 [2021-08-12T04:51:59Z INFO border_core::core::util] Episode 1, 999 steps, reward = 492.571 [2021-08-12T04:52:00Z INFO border_core::core::util] Episode 2, 999 steps, reward = 662.32434 [2021-08-12T04:52:01Z INFO border_core::core::util] Episode 3, 999 steps, reward = 559.0846 [2021-08-12T04:52:02Z INFO border_core::core::util] Episode 4, 999 steps, reward = 686.3971 [2021-08-12T04:52:02Z INFO border_core::core::trainer] Opt step 30000, Eval (mean, min, max) of r_sum: 591.61707, 492.571, 686.3971 [2021-08-12T04:52:02Z INFO border_core::core::trainer] 127.43561 FPS in training [2021-08-12T04:52:02Z INFO border_core::core::trainer] 4.637 sec. in evaluation
Without (e.g. replaced the last device
in build2
call with tch::Device::Cpu
):
[2021-08-12T04:55:26Z INFO border_core::core::trainer] Opt step 20000, Eval (mean, min, max) of r_sum: 549.3173, 429.12888, 761.3656 [2021-08-12T04:55:26Z INFO border_core::core::trainer] 124.081795 FPS in training [2021-08-12T04:55:26Z INFO border_core::core::trainer] 4.811 sec. in evaluation [2021-08-12T04:56:47Z INFO border_core::core::util] Episode 0, 999 steps, reward = 374.09595 [2021-08-12T04:56:48Z INFO border_core::core::util] Episode 1, 999 steps, reward = 540.211 [2021-08-12T04:56:49Z INFO border_core::core::util] Episode 2, 999 steps, reward = 392.3262 [2021-08-12T04:56:50Z INFO border_core::core::util] Episode 3, 999 steps, reward = 353.30508 [2021-08-12T04:56:51Z INFO border_core::core::util] Episode 4, 999 steps, reward = 591.79846 [2021-08-12T04:56:51Z INFO border_core::core::trainer] Opt step 30000, Eval (mean, min, max) of r_sum: 450.34735, 353.30508, 591.79846 [2021-08-12T04:56:51Z INFO border_core::core::trainer] 125.22227 FPS in training [2021-08-12T04:56:51Z INFO border_core::core::trainer] 4.841 sec. in evaluation
The difference in evaluation time seems suspicious though, as it should not be affected.
If you want, I can run more iterations to get a better idea if those 3% were some random occurrence.
@lostmsu Thank you for your report. I think there is no statistical significance in speed for the ant environment. I will do an experiment on atari pong with small buffer size. And regardless of the result, if the updated code looks good for you, I will merge this PR.
@taku-y I find it weird, that the buffer needs to know where the model is. This would not work for multiple GPUs - the training loop needs to move data from where it is to where it has to be, not the buffer.
@taku-y also, is there a reason to keep reward
and not_done
on CPU?
@lostmsu
I find it weird, that the buffer needs to know where the model is. This would not work for multiple GPUs - the training loop needs to move data from where it is to where it has to be, not the buffer.
That sounds reasonable. Supporting multiple GPUs is a nice feature. But I don't have any idea how to implements. I need to have a look at some papers and slides (e.g., paper and slides) and other RL libraries like Ray, PFRL.
On the other hand, I would like agents to be generic, supporting mini-batch which could be a set of Vec
s some enum
s for example. But such agents might be better to implement as other struct; for DQN agent, there could be two versions that support GPU buffer or not.
is there a reason to keep reward and not_done on CPU?
I just didn't aware of it. These buffers should be on a specified device.
@taku-y I spent some time trying to work this up to look pristine, and started to believe, that making TchBuffer
and TchBatch
generic is a mistake. All components should be just Tensor
objects instead of O::SubBatch
, O::Item
, or A::*
. The training loop should simply convert them all to tensors the instant it receives them from the environment. This would also make most models be of type SubModel<Input = Tensor, Output = Tensor>
, potentially removing the need for SubModel
to be generic too.
A good argument for this is that TchBuffer
already leaks the fact, that it heavily relies on tch
because fn batch(...)
takes an argument of type &Tensor
. It is even named Tch
Buffer
!
Do you have a scenario where this behavior (e.g. convert observations and actions to Tensor
instantly) would not work? If not, I'd update this PR to make the modification I am suggesting.
@lostmsu Environments with Tuple
observation in gym wouldn't work with Tensor
observation (link). Another example is a picking robot where observation includes a camera image and joint angles of the manipulator. I want to apply the agents to such environments.
What about Vec<Tensor>
?
@lostmsu Dict
like structure would be better to avoid errors. Another example of such environment is RLBench, providing a set of benchmarks with robots. Each observation has four camera images and other information about the state, like joint positions and angles.
I think I should to add an example to demonstrate flexibility of the library.
This avoids the need to copy replay buffer samples from CPU to GPU every training step. However, if GPU RAM is insufficient, replay buffer can be kept on CPU.
This requires
Obs
to be aTensor
(Act
already is).