mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"
MIT License
304 stars 67 forks source link

Synthetic preferences - no preferences received #2

Closed JawwadF closed 6 years ago

JawwadF commented 6 years ago

I tried to train using synthetic preferences in the MovingDot environment using the command:

python3 run.py train_policy_with_preferences MovingDotNoFrameskip-v0 --synthetic_prefs --ent_coef 0.02 --million_timesteps 0.15

However, it doesn't appear to be learning, and I continuously get the output

Waiting for preferences: 0 so far

Do you know what might be going on here?

(Also great job on this project!)

JawwadF commented 6 years ago

Also, I can confirm that training with the original reward function from the environment works fine.

Command used:

python3 run.py train_policy_with_original_rewards MovingDotNoFrameskip-v0 --million_timesteps 0.25

mrahtz commented 6 years ago

Ah, I'd made a change to easy-tf-log that had broken its fork-safety, so it was hanging when trying to flush events. Fixed now - either fetch an updated copy of Pipfile and do pipenv install, or just do pip uninstall easy-tf-log; pip install easy-tf-log==1.1 inside the virtualenv.

JawwadF commented 6 years ago

It seems to be collecting synthetic preferences now! Thank you, I'll have to check whether it's learning - I'll get back to you to confirm.

Thanks for fixing the issue!

mrahtz commented 6 years ago

Assuming this is fixed.