nottombrown / rl-teacher

Code for Deep RL from Human Preferences [Christiano et al]. Plus a webapp for collecting human feedback
MIT License
559 stars 95 forks source link

Correct reward per episode values from PPOSGD #10

Closed nottombrown closed 7 years ago

nottombrown commented 7 years ago

PPOSGD rollouts currently aren't sliced into individual episodes before being fed into the predictor. This makes our episode calculation incorrect

image

We have a couple options:

  1. Change the rollout behavior of PPO to match what we do in TRPO
  2. Correct the episode calculation logic in the predictor to handle paths that have multiple episodes

I currently think that 1 would be a better option. This would also be an opportunity to parallelize the traj_segment_generator like we do in parallel_trpo, giving us improved performance.

Interested in your thoughts here, @Raelifin

nottombrown commented 7 years ago

Fixed by cutting PPO rollouts into episodes