Propose a way to fit a reward function from human preference, and then perform RL on it to optimize the policy to maximize the cumulative rewards.
Comparison with previous researches. What are the novelties/good points?
Key points
segments generally begin from different states
(This is a the very important assumption to collect a diverse set of segments)
The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long
one segment is preferable,
segments as equally preferable, or
incomparable
Compute a softmax (See eq.1)
10% chance to be uniform. This simulate a human annotation error
Use an ensemble of predictors (ensemble of 3)
Calculate the STD of the predictions. Send highest variance clips to human rators
RL part itself is very basic/straightforward
How the author proved effectiveness of the proposal?
Experiments using OpenAI Gym (Simulated Robotics and Atari)
700 queries to human rater for Simulated Robotics tasks
Summary
Link
Deep reinforcement learning from human preferences
Author/Institution
OpenAI/DeepMind
What is this
Comparison with previous researches. What are the novelties/good points?
Key points
How the author proved effectiveness of the proposal?
Any discussions?
What should I read next?