mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"
MIT License
304 stars 67 forks source link

Deep Reinforcement Learning from Human Preferences

Reproduction of OpenAI and DeepMind's Deep Reinforcement Learning from Human Preferences, based on the paper at https://arxiv.org/pdf/1706.03741.pdf.

Results

The main milestones of this reproduction were:

Usage

Python setup

This project uses Tensorflow 1, which needs Python 3.7 or below.

To set up an isolated environment and install dependencies, install Pipenv, then just run:

$ pipenv install

However, note that TensorFlow must be installed manually. Either:

$ pipenv run pip install tensorflow==1.15

or

$ pipenv run pip install tensorflow-gpu==1.15

depending on whether you have a GPU. (If you run into problems, try installing TensorFlow 1.6.0, which was used for development.)

If you want to run tests, also run:

$ pipenv install --dev

Finally, before running any of the scripts, enter the environment with:

$ pipenv shell

Running

All training is done using run.py. Basic usage is:

$ python3 run.py <mode> <environment>

Supported environments are MovingDotNoFrameskip-v0, PongNoFrameskip-v4, and EnduroNoFrameskip-v4.

Training with original rewards

To train using the original rewards from the environment rather than rewards based on preferences, use the train_policy_with_original_rewards mode.

For example, to train Pong:

$ python3 run.py train_policy_with_original_rewards PongNoFrameskip-v4 --n_envs 16 --million_timesteps 10

Training end-to-end with preferences

Use the train_policy_with_preferences mode.

For example, to train MovingDotNoFrameskip-v0 using synthetic preferences:

$ python3 run.py train_policy_with_preferences MovingDotNoFrameskip-v0 --synthetic_prefs --ent_coef 0.02 --million_timesteps 0.15

On a machine with a GPU, this takes about an hour. TensorBoard logs (created in a new directory in runs/ automatically) should look something like:

To train Pong using synthetic preferences:

$ python3 run.py train_policy_with_preferences PongNoFrameskip-v4 --synthetic_prefs --dropout 0.5 --n_envs 16 --million_timesteps 20

On a 16-core machine without GPU, this takes about 13 hours. TensorBoard logs should look something like:

To train Enduro (a modified version with a time limit so the weather doesn't change, which the paper notes can confuse the reward predictor) using human preferences:

$ python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes

You'll see two windows: a larger one showing a pair of examples of agent behaviour, and another smaller window showing the last full episode that the agent played (so you can see how qualitative behaviour is changing). Enter 'L' in the terminal to indicate that you prefer the left example; 'R' to indicate you prefer the right example; 'E' to indicate you prefer them both equally; and just press enter if the two clips are incomparable.

On an 8-core machine with GPU, it takes about 2.5 hours to reproduce the video above - about an hour to collect 500 preferences about behaviour from a random policy, then half an hour to pretrain the reward predictor using those 500 preferences, then an hour to train the policy (while still collecting preferences.)

The bottleneck is mainly labelling speed, so if you're already saved human preferences in runs/enduro, you can re-use those preferences by training with:

$ python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes --load_prefs_dir runs/enduro --n_initial_epochs 10

This only takes about half an hour.

Piece-by-piece runs

You can also run different parts of the training process separately, saving their results for later use:

For example, to gather synthetic preferences for MovingDotNoFrameskip-v0, saving to run directory moving_dot-initial_prefs:

$ python run.py gather_initial_prefs MovingDotNoFrameskip-v0 --synthetic_prefs --run_name moving_dot-initial_prefs

Running on FloydHub

To run on FloydHub (a cloud platform for running machine learning jobs), use something like:

floyd run --follow --env tensorflow-1.5 --tensorboard 'bash floydhub_utils/floyd_wrapper.sh python run.py --log_dir /output --synthetic_prefs train_policy_with_preferences PongNoFrameskip-v4'

Check out runs reproducing the above results at https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences.

Running checkpoints

To run a trained policy checkpoint so you can see what the agent was doing, use run_checkpoint.py Basic usage is:

$ python3 run_checkpoint.py <environment> <policy checkpoint directory>

For example, to run an agent saved in runs/pong:

$ python3 run_checkpoint.py PongNoFrameskip-v4 runs/pong/policy_checkpoints

Architecture notes

There are three main components:

Data flow

The flow of data begins with the A2C workers, which generate video clips of the agent trying things in the environment.

These video clips (referred to in the code as 'segments') are sent to the preference interface. The preference interface shows pairs of video clips to the user and asks through a command-line interface which clip of each pair shows more of the kind of behaviour the user wants.

Preferences are sent to the reward predictor, which trains a deep neural network to predict the each preference from the associated pair of video clips. Preferences are predicted based on a comparison between two penultimate scalar values in the network (one for each video clip) representing some measure of how much the user likes each of the two clips in the pair.

That network can then be used to predict rewards for future video clips by feeding the clip in, running a forward pass to calculate the "how much the user likes this clip" value, then normalising the result to have zero mean and constant variance across time.

This normalised value is then used directly as a reward signal to train the A2C workers according to the preferences given by the user.

Processes

All components run asynchronously in different subprocesses:

There are three tricky parts to this:

All subprocesses are started and coordinated by run.py.

Changes to the paper's setup

It turned out to be possible to reach the milestones in the results section above even without implementing a number of features described in the original paper.

Ideas for future work

If you want to hack on this project to learn some deep RL, here are some ideas for extensions and things to investigate:

Code credits

A2C code in a2c is based on the implementation from OpenAI's baselines, commit f8663ea.