nottombrown / rl-teacher

Code for Deep RL from Human Preferences [Christiano et al]. Plus a webapp for collecting human feedback
MIT License
556 stars 93 forks source link

Fast random rollouts #14

Closed Raelifin closed 6 years ago

Raelifin commented 6 years ago

Much, much faster random rollout collection. Scales perfectly with more CPU cores. Doesn't rely on parallel-trpo. Simpler logic; can see environment loop, doesn't use tensorflow, doesn't use custom exceptions.

Raelifin commented 6 years ago

Oh, it's also now compatible with ATARI and other environments with discrete actions.

nottombrown commented 6 years ago

This looks like a solid refactor!

Have you tested to make sure that sampling random rollouts from a different policy doesn't degrade performance much? As a sanity check, I'd be interested in seeing before and after learning curves on 700 label hopper.

nottombrown commented 6 years ago

Boom. LGTM