This repository implements Discriminative-model-based Episodic Intrinsic Reward (DEIR), an exploration method for reinforcement learning that has been found to be particularly effective in environments with stochasticity and partial observability. More details can be found in the original paper "DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards" (arXiv preprint, to appear at IJCAI 2023).
Our PPO implementation is based on Stable Baselines 3. In case you are mainly interested in the implementation of DEIR, its major components can be found at src/algo/intrinsic_rewards/deir.py.
Video demos of DEIR: (1) Introduction & MiniGrid demo, (2) ProcGen demo.
conda create -n deir python=3.9
conda activate deir
git clone https://github.com/swan-utokyo/deir.git
cd deir
python3 -m pip install -r requirements.txt
Run the below command in the root directory of this repository to train a DEIR agent in the standard DoorKey-8x8 (MiniGrid) environment.
PYTHONPATH=./ python3 src/train.py \
--int_rew_source=DEIR \
--env_source=minigrid \
--game_name=DoorKey-8x8
PYTHONPATH=./ python3 src/train.py \
--int_rew_source=DEIR \
--env_source=minigrid \
--game_name=DoorKey-8x8-ViewSize-3x3 \
--can_see_walls=0 \
--image_noise_scale=0.1
run: 0 iters: 1 frames: 8192 rew: 0.389687 rollout: 9.281 sec train: 1.919 sec
run: 0 iters: 2 frames: 16384 rew: 0.024355 rollout: 9.426 sec train: 1.803 sec
run: 0 iters: 3 frames: 24576 rew: 0.032737 rollout: 8.622 sec train: 1.766 sec
run: 0 iters: 4 frames: 32768 rew: 0.036805 rollout: 8.309 sec train: 1.776 sec
run: 0 iters: 5 frames: 40960 rew: 0.043546 rollout: 8.370 sec train: 1.768 sec
run: 0 iters: 6 frames: 49152 rew: 0.068045 rollout: 8.337 sec train: 1.772 sec
run: 0 iters: 7 frames: 57344 rew: 0.112299 rollout: 8.441 sec train: 1.754 sec
run: 0 iters: 8 frames: 65536 rew: 0.188911 rollout: 8.328 sec train: 1.732 sec
run: 0 iters: 9 frames: 73728 rew: 0.303772 rollout: 8.354 sec train: 1.741 sec
run: 0 iters: 10 frames: 81920 rew: 0.519742 rollout: 8.239 sec train: 1.749 sec
run: 0 iters: 11 frames: 90112 rew: 0.659334 rollout: 8.324 sec train: 1.777 sec
run: 0 iters: 12 frames: 98304 rew: 0.784067 rollout: 8.869 sec train: 1.833 sec
run: 0 iters: 13 frames: 106496 rew: 0.844819 rollout: 9.068 sec train: 1.740 sec
run: 0 iters: 14 frames: 114688 rew: 0.892450 rollout: 8.077 sec train: 1.745 sec
run: 0 iters: 15 frames: 122880 rew: 0.908270 rollout: 7.873 sec train: 1.738 sec
PYTHONPATH=./ python3 src/train.py \
--int_rew_source=DEIR \
--env_source=procgen --game_name=ninja --total_steps=100_000_000 \
--num_processes=64 --n_steps=256 --batch_size=2048 \
--n_epochs=3 --model_n_epochs=3 \
--learning_rate=1e-4 --model_learning_rate=1e-4 \
--policy_cnn_type=2 --features_dim=256 --latents_dim=256 \
--model_cnn_type=1 --model_features_dim=64 --model_latents_dim=256 \
--policy_cnn_norm=LayerNorm --policy_mlp_norm=NoNorm \
--model_cnn_norm=LayerNorm --model_mlp_norm=NoNorm \
--adv_norm=0 --adv_eps=1e-5 --adv_momentum=0.9
Please note that the default value of each option in src/train.py
is optimized for DEIR. For now, when training other methods, please use the corresponding hyperparameter values specified in Table A1 (in our arXiv preprint). An example is --int_rew_coef=3e-2
and --rnd_err_norm=0
in the below command.
PYTHONPATH=./ python3 src/train.py \
--int_rew_source=NovelD \
--env_source=minigrid \
--game_name=DoorKey-8x8 \
--int_rew_coef=3e-2 \
--rnd_err_norm=0