Summary

OpenAI/DeepMind

What is this

Propose a way to fit a reward function from human preference, and then perform RL on it to optimize the policy to maximize the cumulative rewards.

segments generally begin from different states
- (This is a the very important assumption to collect a diverse set of segments)
The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long
- one segment is preferable,
- segments as equally preferable, or
- incomparable
Compute a softmax (See eq.1)
- 10% chance to be uniform. This simulate a human annotation error
Use an ensemble of predictors (ensemble of 3)
- Calculate the STD of the predictions. Send highest variance clips to human rators
RL part itself is very basic/straightforward