DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Metadata

Authors: Irina Higgins, Arka Pal, +6 authors Alexander Lerchner
Organization: DeepMind
Conference: ICML 2017
Paper: https://arxiv.org/pdf/1707.08475.pdf

Abstract

This paper focuses on domain adaption issues in RL settings where an agent trained on a particular input distribution with a specified reward structure (source domain) is modified but the reward structure remains largely intact (target domain). The target domain can be unknown.
This paper aims to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain. Here, a policy is considered as robust if it generalizes with minimal drop in performance to the target domain without extra fine-tuning.
This paper tackles the domain adaption problem by learning a disentangled/factorized representation of the world. Examples of such factors of variation in the world are object properties like color, scale, or position; other examples correspond to general environmental factors, such as geometry and lighting.
The purposed system, DARLA, relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors. Crucially, DARLA does not require target domain data to form its representations.

Framework

Formalized problem setting

The source domain D{S} ≡ (S{S}, A{S}, T{S}, R_{S}).
The target domain D{T} ≡ (S{T}, A{T}, T{T}, R_{T}).
Where S: State; A: Action; T: Transition function; R: Reward.
The domain adaption scenario: S{S} ≠ S{T}; A{S} =A{T}; T{S} ≈ T{T}; R{S} ≈ R{T}.
For example. Robot arm in simulated environment and real world: S: Raw pixels; A: Robot's action; T: Physics of the world; R: The performance on the task.

DARLA

Three stages pipeline:

Learning to see (the main contribution):
- Use a random policy to interact with environment to collect observations (require sufficient variability of factors and their conjunctions).
- Pretrain a β-VAE (#33) on those observations.
- However, the shortcomings of reconstructing in pixel space are known and have been addressed in reconstruction in feature space given by another neural network. (e.g., GAN or pretrained AlexNet)
- In practice, this paper found that using a denoising autoencoder (DAE) for β-VAE works best.
- In detail, they follow the masking noise of [1] with the aim for the DAE to learn a semantic representation of the input frames.
  
  Problem: The DAE might also suffer from domain adapation problem. If the semantic representation learned by DAE doesn't transfer well from source to target domain, the β-VAE, which depends on DAE, will also suffer.
- After pretraining DAE, train β-VAE for reconstruction in DAE's feature space using L2 distance. DAE remains frozen.
Learning to act: The agent is tasked with learning the source policy via a standard RL algorithms (DQN, A3C and Episodic Control). The parameters of the encoder (which encodes raw pixels to internal state for the decoder to predict actions) of agent will not be updated. They also compared with UNREAL.
Transfer: Since the encoder already learns the disentangled representation of the world of source domain, such a policy would then generalize well to the target domain out-of-the-box. In this stage, we simply evaluate the agent in target domain without retraining.

Reference

[1] Context Encoders: Feature Learning by Inpainting by Pathak et al. CVPR 2016.

howardyclo / papernotes

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning #34

Metadata

Abstract

Framework

Formalized problem setting

DARLA

Reference

Further Readings