Using Reward Predictor - Githubissues

mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"

MIT License

307 stars 67 forks source link

Hi! I'm following the piece-by-piece runs and it's working fine up to the pretrain_reward_predictor is working fine. but the part where I use the generated reward predictor to learn the policy doesn't seem to be working properly.

This is the code I ran, python3 run.py train_policy_with_preferences BreakoutNoFrameskip-v4 --load_reward_predictor_ckpt_dir runs/breakout-initial_predictor_3fca07c/reward_predictor_checkpoints --n_envs 16 --million_timesteps 0.1 and when I run it, it brings up the preference collection window again.

How should I write the code to learn using the newly created reward predictor?

mrahtz / learning-from-human-preferences

Using Reward Predictor #16