nottombrown / rl-teacher

Code for Deep RL from Human Preferences [Christiano et al]. Plus a webapp for collecting human feedback
MIT License
559 stars 95 forks source link

Debug PPO #20

Closed nottombrown closed 7 years ago

nottombrown commented 7 years ago

TRPO matches performance from before

python rl_teacher/teach.py -w 4 -a parallel_trpo -p synth -l 700 -e Reacher-v1 -n debug-ppo/trpo-synth-64-700-s1 -V -s 1 image

PPO is also learning well

image

Although 700 label performance may need some tuning