Allow choosing the torch device to store replay buffer

lostmsu commented 3 years ago

This avoids the need to copy replay buffer samples from CPU to GPU every training step. However, if GPU RAM is insufficient, replay buffer can be kept on CPU.

This requires Obs to be a Tensor (Act already is).

taku-y commented 3 years ago

@lostmsu Thank you for your PR. As to reduce trait bounds for observation in agents like DQN, I added TchBufferOnDevice trait, which is responsible for transferring data to the device of the model.

However, I don't see any speed up with a replay buffer on GPU in an example (sac_ant_gpu). It might be due to small memory footprint of vector observation of the Ant environment. Though the effect of GPU for replay buffer could be significant when the memory footprint of observation is large, like Atari environment with stacked images, the limited GPU memory does not allow us to use a large capacity of replay buffer (Atari envs requires about 20〜30GB of memory for capacity of 1,000,000 samples). If you have cases where replay buffer on GPU works, it would be nice for me to share information.

lostmsu commented 3 years ago

@taku-y I tried your sac_ant_gpu, and I get about 3% FPS improvement from using GPU replay buffer (I also had to increase batch size to 4096 to put some noticeable load on GPU).

With GPU buffer:

[2021-08-12T04:50:39Z INFO border_core::core::trainer] Opt step 20000, Eval (mean, min, max) of r_sum: 422.5705, 347.70502, 501.35193 [2021-08-12T04:50:39Z INFO border_core::core::trainer] 129.4951 FPS in training [2021-08-12T04:50:39Z INFO border_core::core::trainer] 4.607 sec. in evaluation [2021-08-12T04:51:58Z INFO border_core::core::util] Episode 0, 999 steps, reward = 557.7083 [2021-08-12T04:51:59Z INFO border_core::core::util] Episode 1, 999 steps, reward = 492.571 [2021-08-12T04:52:00Z INFO border_core::core::util] Episode 2, 999 steps, reward = 662.32434 [2021-08-12T04:52:01Z INFO border_core::core::util] Episode 3, 999 steps, reward = 559.0846 [2021-08-12T04:52:02Z INFO border_core::core::util] Episode 4, 999 steps, reward = 686.3971 [2021-08-12T04:52:02Z INFO border_core::core::trainer] Opt step 30000, Eval (mean, min, max) of r_sum: 591.61707, 492.571, 686.3971 [2021-08-12T04:52:02Z INFO border_core::core::trainer] 127.43561 FPS in training [2021-08-12T04:52:02Z INFO border_core::core::trainer] 4.637 sec. in evaluation

Without (e.g. replaced the last device in build2 call with tch::Device::Cpu):

[2021-08-12T04:55:26Z INFO border_core::core::trainer] Opt step 20000, Eval (mean, min, max) of r_sum: 549.3173, 429.12888, 761.3656 [2021-08-12T04:55:26Z INFO border_core::core::trainer] 124.081795 FPS in training [2021-08-12T04:55:26Z INFO border_core::core::trainer] 4.811 sec. in evaluation [2021-08-12T04:56:47Z INFO border_core::core::util] Episode 0, 999 steps, reward = 374.09595 [2021-08-12T04:56:48Z INFO border_core::core::util] Episode 1, 999 steps, reward = 540.211 [2021-08-12T04:56:49Z INFO border_core::core::util] Episode 2, 999 steps, reward = 392.3262 [2021-08-12T04:56:50Z INFO border_core::core::util] Episode 3, 999 steps, reward = 353.30508 [2021-08-12T04:56:51Z INFO border_core::core::util] Episode 4, 999 steps, reward = 591.79846 [2021-08-12T04:56:51Z INFO border_core::core::trainer] Opt step 30000, Eval (mean, min, max) of r_sum: 450.34735, 353.30508, 591.79846 [2021-08-12T04:56:51Z INFO border_core::core::trainer] 125.22227 FPS in training [2021-08-12T04:56:51Z INFO border_core::core::trainer] 4.841 sec. in evaluation

The difference in evaluation time seems suspicious though, as it should not be affected.

If you want, I can run more iterations to get a better idea if those 3% were some random occurrence.

taku-y commented 3 years ago

@lostmsu Thank you for your report. I think there is no statistical significance in speed for the ant environment. I will do an experiment on atari pong with small buffer size. And regardless of the result, if the updated code looks good for you, I will merge this PR.

lostmsu commented 3 years ago

@taku-y I find it weird, that the buffer needs to know where the model is. This would not work for multiple GPUs - the training loop needs to move data from where it is to where it has to be, not the buffer.

lostmsu commented 3 years ago

@taku-y also, is there a reason to keep reward and not_done on CPU?

taku-y commented 3 years ago

@lostmsu

I find it weird, that the buffer needs to know where the model is. This would not work for multiple GPUs - the training loop needs to move data from where it is to where it has to be, not the buffer.

That sounds reasonable. Supporting multiple GPUs is a nice feature. But I don't have any idea how to implements. I need to have a look at some papers and slides (e.g., paper and slides) and other RL libraries like Ray, PFRL.

On the other hand, I would like agents to be generic, supporting mini-batch which could be a set of Vecs some enums for example. But such agents might be better to implement as other struct; for DQN agent, there could be two versions that support GPU buffer or not.

is there a reason to keep reward and not_done on CPU?

I just didn't aware of it. These buffers should be on a specified device.

lostmsu commented 3 years ago

@taku-y I spent some time trying to work this up to look pristine, and started to believe, that making TchBuffer and TchBatch generic is a mistake. All components should be just Tensor objects instead of O::SubBatch, O::Item, or A::*. The training loop should simply convert them all to tensors the instant it receives them from the environment. This would also make most models be of type SubModel<Input = Tensor, Output = Tensor>, potentially removing the need for SubModel to be generic too.

A good argument for this is that TchBuffer already leaks the fact, that it heavily relies on tch because fn batch(...) takes an argument of type &Tensor. It is even named TchBuffer!

Do you have a scenario where this behavior (e.g. convert observations and actions to Tensor instantly) would not work? If not, I'd update this PR to make the modification I am suggesting.

taku-y commented 3 years ago

@lostmsu Environments with Tuple observation in gym wouldn't work with Tensor observation (link). Another example is a picking robot where observation includes a camera image and joint angles of the manipulator. I want to apply the agents to such environments.

lostmsu commented 3 years ago

What about Vec<Tensor>?

taku-y commented 3 years ago

@lostmsu Dict like structure would be better to avoid errors. Another example of such environment is RLBench, providing a set of benchmarks with robots. Each observation has four camera images and other information about the state, like joint positions and angles.

https://github.com/stepjam/RLBench/blob/3aa9bb3ad534d8fcdeb93d3f5ff1d161ce5c8fe6/rlbench/gym/rlbench_env.py#L74-L81

I think I should to add an example to demonstrate flexibility of the library.

laboroai / border

Allow choosing the torch device to store replay buffer #44