vub-ai-lab / bdpi

Sample-Efficient Reinforcement Learning with Bootstrapped Dual Policy Iteration
GNU General Public License v3.0
25 stars 5 forks source link

Sample-Efficient Reinforcement Learning with Bootstrapped Dual Policy Iteration

This repository contains the complete implementation of the Bootstrapped Dual Policy Iteration algorithm we developed over the past year and a half. The repository also contains scripts to re-run and re-plot our experiments.

What is BDPI?

BDPI is a model-free reinforcement-learning algorithm for discrete action spaces, continuous or discrete state spaces, written in Python with PyTorch and to be used with OpenAI Gym environments. This implementation, by its use of feed-forward neural networks, is tailored to Markov Decision Processes (not POMDPs).

How is it different?

Many reinforcement learning algorithms exist. BDPI is different from them in several key points:

Overall Idea

The poster presents the general architecture of BDPI:

Sample-Efficient Reinforcement Learning with Bootstrapped Dual Policy Iteration

Explaining BDPI can be done by following the loop:

Because the actor has a learning rate, it learns the expected greedy policy of the critics, that, being trained off-policy, learn the optimal Q-Function. And because the greedy policy of the optimal Q-Function is the optimal policy, the actor learns the optimal policy. The way the actor moves towards that optimal policy closely resembles Thompson sampling, empirically shown to be an extremely good exploration strategy. This may explain why our results look so "BDPI learns immediately while the other algorithms are flat lines", BDPI almost perfectly balances exploration and exploitation, even though it does not have any advanced feature like novelty-based exploration, reward shaping, or reward rewriting.

Source Code

The following components are available in this repository

The files are organized as follows:

Dependencies

Reproducing our results require a computer with the following components: