Reproduction of OpenAI and DeepMind's Deep Reinforcement Learning from Human Preferences, based on the paper at https://arxiv.org/pdf/1706.03741.pdf.
The main milestones of this reproduction were:
Training an agent to move the dot to the middle in a simple environment using synthetic preferences.
Training an agent to play Pong using synthetic preferences.
Training an agent to stay alongside other cars in Enduro using human preferences.
This project uses Tensorflow 1, which needs Python 3.7 or below.
To set up an isolated environment and install dependencies, install Pipenv, then just run:
$ pipenv install
However, note that TensorFlow must be installed manually. Either:
$ pipenv run pip install tensorflow==1.15
or
$ pipenv run pip install tensorflow-gpu==1.15
depending on whether you have a GPU. (If you run into problems, try installing TensorFlow 1.6.0, which was used for development.)
If you want to run tests, also run:
$ pipenv install --dev
Finally, before running any of the scripts, enter the environment with:
$ pipenv shell
All training is done using run.py
. Basic usage is:
$ python3 run.py <mode> <environment>
Supported environments are
MovingDotNoFrameskip-v0
,
PongNoFrameskip-v4
, and EnduroNoFrameskip-v4
.
To train using the original rewards from the environment rather than rewards
based on preferences, use the train_policy_with_original_rewards
mode.
For example, to train Pong:
$ python3 run.py train_policy_with_original_rewards PongNoFrameskip-v4 --n_envs 16 --million_timesteps 10
Use the train_policy_with_preferences
mode.
For example, to train MovingDotNoFrameskip-v0
using synthetic preferences:
$ python3 run.py train_policy_with_preferences MovingDotNoFrameskip-v0 --synthetic_prefs --ent_coef 0.02 --million_timesteps 0.15
On a machine with a GPU, this takes about an hour. TensorBoard logs (created in
a new directory in runs/
automatically) should look something like:
To train Pong using synthetic preferences:
$ python3 run.py train_policy_with_preferences PongNoFrameskip-v4 --synthetic_prefs --dropout 0.5 --n_envs 16 --million_timesteps 20
On a 16-core machine without GPU, this takes about 13 hours. TensorBoard logs should look something like:
To train Enduro (a modified version with a time limit so the weather doesn't change, which the paper notes can confuse the reward predictor) using human preferences:
$ python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes
You'll see two windows: a larger one showing a pair of examples of agent behaviour, and another smaller window showing the last full episode that the agent played (so you can see how qualitative behaviour is changing). Enter 'L' in the terminal to indicate that you prefer the left example; 'R' to indicate you prefer the right example; 'E' to indicate you prefer them both equally; and just press enter if the two clips are incomparable.
On an 8-core machine with GPU, it takes about 2.5 hours to reproduce the video above - about an hour to collect 500 preferences about behaviour from a random policy, then half an hour to pretrain the reward predictor using those 500 preferences, then an hour to train the policy (while still collecting preferences.)
The bottleneck is mainly labelling speed, so if you're already saved human preferences in runs/enduro
, you can re-use those preferences by training with:
$ python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes --load_prefs_dir runs/enduro --n_initial_epochs 10
This only takes about half an hour.
You can also run different parts of the training process separately, saving their results for later use:
gather_initial_prefs
mode to gather the initial 500 preferences
used for pretraining the reward predictor. This saves preferences to
train_initial.pkl.gz
and val_initial.pkl.gz
in the run directory.pretrain_reward_predictor
to just pretrain the reward predictor (200
epochs). Specify the run directory to load initial preferences from with
--load_prefs_dir
.--load_reward_predictor_ckpt
argument when running in train_policy_with_preferences
mode.For example, to gather synthetic preferences for MovingDotNoFrameskip-v0
,
saving to run directory moving_dot-initial_prefs
:
$ python run.py gather_initial_prefs MovingDotNoFrameskip-v0 --synthetic_prefs --run_name moving_dot-initial_prefs
To run on FloydHub (a cloud platform for running machine learning jobs), use something like:
floyd run --follow --env tensorflow-1.5 --tensorboard 'bash floydhub_utils/floyd_wrapper.sh python run.py --log_dir /output --synthetic_prefs train_policy_with_preferences PongNoFrameskip-v4'
Check out runs reproducing the above results at https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences.
To run a trained policy checkpoint so you can see what the agent was doing, use
run_checkpoint.py
Basic usage is:
$ python3 run_checkpoint.py <environment> <policy checkpoint directory>
For example, to run an agent saved in runs/pong
:
$ python3 run_checkpoint.py PongNoFrameskip-v4 runs/pong/policy_checkpoints
There are three main components:
a2c/a2c/a2c.py
)pref_interface.py
)reward_predictor.py
)The flow of data begins with the A2C workers, which generate video clips of the agent trying things in the environment.
These video clips (referred to in the code as 'segments') are sent to the preference interface. The preference interface shows pairs of video clips to the user and asks through a command-line interface which clip of each pair shows more of the kind of behaviour the user wants.
Preferences are sent to the reward predictor, which trains a deep neural network to predict the each preference from the associated pair of video clips. Preferences are predicted based on a comparison between two penultimate scalar values in the network (one for each video clip) representing some measure of how much the user likes each of the two clips in the pair.
That network can then be used to predict rewards for future video clips by feeding the clip in, running a forward pass to calculate the "how much the user likes this clip" value, then normalising the result to have zero mean and constant variance across time.
This normalised value is then used directly as a reward signal to train the A2C workers according to the preferences given by the user.
All components run asynchronously in different subprocesses:
There are three tricky parts to this:
All subprocesses are started and coordinated by run.py
.
It turned out to be possible to reach the milestones in the results section above even without implementing a number of features described in the original paper.
reward_predictor.py
, but we always operate with only
a single-member ensemble, and pref_interface.py
just
chooses segments randomly.)If you want to hack on this project to learn some deep RL, here are some ideas for extensions and things to investigate:
run_checkpoint.py
with a reward
predictor checkpoint), it looks like the predicted rewards might be slightly
better-shaped than the original rewards, even when trained with synthetic
preferences based on the original rewards. Specifically, in Pong, it looks
like there might be a small positive reward whenever the agent hits the ball.
Could a reward predictor trained from synthetic preferences be used to
automatically shape rewards for easier training?A2C code in a2c
is based on the implementation from OpenAI's baselines, commit f8663ea
.