mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"
MIT License
304 stars 67 forks source link

Using Reward Predictor #16

Open eunjuyummy opened 7 months ago

eunjuyummy commented 7 months ago

Hi! I'm following the piece-by-piece runs and it's working fine up to the pretrain_reward_predictor is working fine. but the part where I use the generated reward predictor to learn the policy doesn't seem to be working properly.

This is the code I ran, python3 run.py train_policy_with_preferences BreakoutNoFrameskip-v4 --load_reward_predictor_ckpt_dir runs/breakout-initial_predictor_3fca07c/reward_predictor_checkpoints --n_envs 16 --million_timesteps 0.1 and when I run it, it brings up the preference collection window again.

How should I write the code to learn using the newly created reward predictor?

mrahtz commented 7 months ago

Hmm, it's been such a long time since I wrote this code that I can't remember how it all fits together now, and I don't think I'll have time any time soon to dig into it again. Sorry not to be of more help!