Hi!
I'm following the piece-by-piece runs and it's working fine up to the pretrain_reward_predictor is working fine.
but the part where I use the generated reward predictor to learn the policy doesn't seem to be working properly.
This is the code I ran,
python3 run.py train_policy_with_preferences BreakoutNoFrameskip-v4 --load_reward_predictor_ckpt_dir runs/breakout-initial_predictor_3fca07c/reward_predictor_checkpoints --n_envs 16 --million_timesteps 0.1
and when I run it, it brings up the preference collection window again.
How should I write the code to learn using the newly created reward predictor?
Hmm, it's been such a long time since I wrote this code that I can't remember how it all fits together now, and I don't think I'll have time any time soon to dig into it again. Sorry not to be of more help!
Hi! I'm following the piece-by-piece runs and it's working fine up to the pretrain_reward_predictor is working fine. but the part where I use the generated reward predictor to learn the policy doesn't seem to be working properly.
This is the code I ran,
python3 run.py train_policy_with_preferences BreakoutNoFrameskip-v4 --load_reward_predictor_ckpt_dir runs/breakout-initial_predictor_3fca07c/reward_predictor_checkpoints --n_envs 16 --million_timesteps 0.1
and when I run it, it brings up the preference collection window again.How should I write the code to learn using the newly created reward predictor?