PPO support for multiple epochs and mini-batch sizes

An extension for the PPO implementation to handle the PPO algorithm with different epochs and mini-batch sizes as proposed here: https://arxiv.org/pdf/1707.06347.pdf

The algorithm works as follows: for a given number of iteration I, N Actors interact with the environment by executing trajectories of length T and computing estimates for advantages. Next, the collected data from the interaction phase is used to update the policy. K epochs are used over the collected data are used to update the policy in a mini-batch updates fashion. A mini-batch size M <= N x T is used to update the policy.

The current implementation assumes M is a multiple of T since this is much easier to implement in the current abstraction, and makes sense in the current episodic setting, thus M <= N.

A new training function trainloop_ppo has been added to the utility class. The function requires three more arguments: the number of actors: N, the mini-batch size M <= N, and the number of epochs K.

hal3 / macarico

PPO support for multiple epochs and mini-batch sizes #36