Nikolay Savinov¹, Anton Raichuk², Raphaël Marinier², Damien Vincent², Marc Pollefeys¹, Timothy Lillicrap³, Sylvain Gelly²
¹ETH Zurich, ²Google AI, ³DeepMind
Navigation out of curiosity | Locomotion out of curiosity |
---|---|
This is an implementation of our ICLR 2019 Episodic Curiosity Through Reachability. If you use this work, please cite:
@inproceedings{Savinov2019_EC,
Author = {Savinov, Nikolay and Raichuk, Anton and Marinier, Rapha{\"e}l and Vincent, Damien and Pollefeys, Marc and Lillicrap, Timothy and Gelly, Sylvain},
Title = {Episodic Curiosity through Reachability},
Booktitle = {International Conference on Learning Representations ({ICLR})},
Year = {2019}
}
The code was tested on Linux only. The code assumes that the command "python" invokes python 2.7. We recommend you use virtualenv:
sudo apt-get install python-pip
pip install virtualenv
python -m virtualenv episodic_curiosity_env
source episodic_curiosity_env/bin/activate
Clone this repository:
git clone https://github.com/google-research/episodic-curiosity.git
cd episodic-curiosity
We require a modified version of DeepMind lab:
Clone DeepMind Lab:
git clone https://github.com/deepmind/lab
cd lab
Apply our patch to DeepMind Lab:
git checkout 7b851dcbf6171fa184bf8a25bf2c87fe6d3f5380
git checkout -b modified_dmlab
git apply ../third_party/dmlab/dmlab_min_goal_distance.patch
Install DMLab as a PIP module by following these instructions
In a nutshell, once you've installed DMLab dependencies, you need to run:
bazel build -c opt python/pip_package:build_pip_package
./bazel-bin/python/pip_package/build_pip_package /tmp/dmlab_pkg
pip install /tmp/dmlab_pkg/DeepMind_Lab-1.0-py2-none-any.whl --force-reinstall
If you wish to run Mujoco experiments (section S1 of the paper), you need to
install dm_control and its dependencies. See this
documentation,
and replace pip install -e .
by pip install -e .[mujoco]
in the command
below.
Finally, install episodic curiosity and its pip dependencies:
cd episodic-curiosity
pip install -e .
Environment | Training method | Required GPU | Recommended RAM |
---|---|---|---|
DMLab | PPO | No | 32GBs |
DMLab | PPO + Grid Oracle | No | 32GBs |
DMLab | PPO + EC using already trained R-networks | No | 32GBs |
DMLab | PPO + EC with R-network training | Yes, otherwise, training is slower by >20x. Required GPU RAM: 5GBs |
50GBs Tip: reduce dataset_buffer_size for using less RAM at the expense of policy performance. |
DMLab | PPO + ECO | Yes, otherwise, raining is slower by >20x. Required GPU RAM: 5GBs |
80GBs Tip: reduce observation_history_size for using less RAM, at the expense of policy performance |
Mujoco | PPO + EC using already trained R-networks | No | 32GBs |
Trained R-networks and policies can be found in the
episodic-curiosity
Google cloud bucket. You can access them via the
web interface,
or copy them with the gsutil
command from the
Google Cloud SDK:
gsutil -m cp -r gs://episodic-curiosity/r_networks .
gsutil -m cp -r gs://episodic-curiosity/policies .
Example of command to visualize a trained policy with two episodes of 1000 steps, and create videos similar to the ones at the top of this page:
python -m episodic_curiosity.visualize_curiosity_reward --workdir=/tmp/ec_visualizations --r_net_weights=<path_to_r_network> --policy_path=<path_to_trained_policy> --alsologtostderr --num_episodes=2 --num_steps=1000 --visualization_type=surrogate_reward --trajectory_mode=do_nothing
This requires that you install extra dependencies for generating
videos, with pip install -e .[video]
scripts/launcher_script.py is the main entry point to reproduce the results of Table 1 in the paper. For instance, the following command line launches training of the PPO + EC method on the Sparse+Doors scenario:
python episodic-curiosity/scripts/launcher_script.py --workdir=/tmp/ec_workdir --method=ppo_plus_ec --scenario=sparseplusdoors
Main flags:
Flag | Descriptions |
---|---|
--method | Solving method to use, corresponds to the rows in table 1 of the paper. Possible values: ppo, ppo_plus_ec, ppo_plus_eco, ppo_plus_grid_oracle |
--scenario | Scenario to launch. Corresponds to the columns in table 1 of the paper. Possible values: noreward, norewardnofire, sparse, verysparse, sparseplusdoors, dense1, dense2 . ant_no_reward is also supported which corresponds to the first row of table S1. |
--workdir | Directory where logs and checkpoints will be stored. |
--run_number | Run number of the current run. This is used to create an appropriate subdir in workdir. |
--r_networks_path | Only meaningful for the ppo_plus_ec method. Path to the root dir for pre-trained r networks. If specified, we train the policy using those pre-trained r networks. If not specified, we first generate the R network training data, train the R network and then train the policy. |
Training takes a couple of days. We used CPUs with 16 hyper-threads, but smaller CPUs should do.
Under the hood,
launcher_script.py
launches
train_policy.py
with the right hyperparameters. For the method ppo_plus_ec
, it first launches
generate_r_training_data.py
to accumulate training data for the R-network using a random policy, then
launches
train_r.py
to train the R-network, and finally
train_policy.py
for the policy. In the method ppo_plus_eco
, all this happens online as part of
the policy training.
First, make sure you have the Google Cloud SDK installed.
scripts/launch_cloud_vms.py
is the main entry point. Edit the script and replace the FILL-ME
s with the
details of your GCP project. In particular, you will need to point it to a GCP
disk snapshot with the installed dependencies as described in the
Installation section.
IMPORTANT: By default the script reproduces all results in table 1 and launches
~300 VMs on cloud with GPUs (7 scenarios x 4 methods x 10 runs). The cost of
running all those VMs is very significant: on the order of USD 30 per day
per VM based on early 2019 GCP pricing. Pass
--i_understand_launching_vms_is_expensive
to
scripts/launch_cloud_vms.py
to indicate that you understood that.
Under the hood, launch_cloud_vms.py
launches one VM for each (scenario,
method, run_number) tuple. The VMs use startup scripts to launch training, and
retrieve the parameters of the run through
Instance Metadata.
TIP: Use sudo journalctl -u google-startup-scripts.service
to see the logs of
the startup script.
Each training job stores logs and checkpoints in a workdir. The workdir is organized as follows:
File or Directory | Description |
---|---|
r_training_data/{R_TRAINING,VALIDATION}/ |
TF Records with data generated from a random policy for R-network training. Only for method ppo_plus_ec without supplying pre-trained R-networks. |
r_networks/ |
Keras checkpoints of trained R-networks. Only for method ppo_plus_ec without supplying pre-trained R-networks. |
reward_{train,valid,test}.csv |
CSV files with {train,valid,test} rewards, tracking the performance of the policy at multiple training steps. |
checkpoints/ |
Checkpoints of the policy. |
log.txt , progress.csv |
Training logs and CSV from OpenAI's PPO2 code. |
On cloud, the workdir of each job will be synced to a cloud bucket directory of
the form <cloud_bucket_root>/<vm_id>/<method>/<scenario>/run_number_<d>/
.
We provide a
colab
to plot graphs during training of the policies, using data from the
reward_{train,valid,test}.csv
files.
Check out the code for Semi-parametric Topological Memory, which uses graph-based episodic memory constructed from a short video to navigate in novel environments (thus providing exploitation policy, complementary to the exploration policy in this work).
ppo_plus_eco
method is not robust to restarts, because
the R-network trained online is not checkpointed.