Evaluate Parallelization of PPO program structure

As it is now, the PPO algorithm first calls a function that triggers all the workers to step through episodes and produce samples this way. After that those samples are save in the replay buffer and in multiple learner iterations randomly sampled from the replay buffer. These multiple learner iterations make up one learner step.

Maybe there is a way to let the workers do their work while the learner iterations are running.