PPOSGD rollouts currently aren't sliced into individual episodes before being fed into the predictor. This makes our episode calculation incorrect
We have a couple options:
Change the rollout behavior of PPO to match what we do in TRPO
Correct the episode calculation logic in the predictor to handle paths that have multiple episodes
I currently think that 1 would be a better option. This would also be an opportunity to parallelize the traj_segment_generator like we do in parallel_trpo, giving us improved performance.
PPOSGD rollouts currently aren't sliced into individual episodes before being fed into the predictor. This makes our episode calculation incorrect
We have a couple options:
I currently think that
1
would be a better option. This would also be an opportunity to parallelize the traj_segment_generator like we do inparallel_trpo
, giving us improved performance.Interested in your thoughts here, @Raelifin