An extension for the PPO implementation to handle the PPO algorithm with different epochs and mini-batch sizes as proposed here: https://arxiv.org/pdf/1707.06347.pdf
The algorithm works as follows: for a given number of iteration I, N Actors interact with the environment by executing trajectories of length T and computing estimates for advantages. Next, the collected data from the interaction phase is used to update the policy. K epochs are used over the collected data are used to update the policy in a mini-batch updates fashion. A mini-batch size M <= N x T is used to update the policy.
The current implementation assumes M is a multiple of T since this is much easier to implement in the current abstraction, and makes sense in the current episodic setting, thus M <= N.
A new training function trainloop_ppo has been added to the utility class. The function requires three more arguments: the number of actors: N, the mini-batch size M <= N, and the number of epochs K.
An extension for the PPO implementation to handle the PPO algorithm with different epochs and mini-batch sizes as proposed here: https://arxiv.org/pdf/1707.06347.pdf
The algorithm works as follows: for a given number of iteration
I
,N
Actors interact with the environment by executing trajectories of lengthT
and computing estimates for advantages. Next, the collected data from the interaction phase is used to update the policy.K
epochs are used over the collected data are used to update the policy in a mini-batch updates fashion. A mini-batch sizeM <= N x T
is used to update the policy.The current implementation assumes
M
is a multiple ofT
since this is much easier to implement in the current abstraction, and makes sense in the current episodic setting, thusM <= N
.A new training function
trainloop_ppo
has been added to the utility class. The function requires three more arguments: the number of actors:N
, the mini-batch sizeM <= N
, and the number of epochsK
.