nottombrown / rl-teacher

Code for Deep RL from Human Preferences [Christiano et al]. Plus a webapp for collecting human feedback
MIT License
556 stars 93 forks source link

Fix rollouts #21

Closed nottombrown closed 6 years ago

nottombrown commented 6 years ago

Rollouts were using multiple workers, and each one had the same default seed for both the environment and for the pseudorandom random module. This was leading to pretraining comparisons of things that were basically identical.

Before


array([[-1.53344645],
       [-2.0461825 ],
       [-2.33774915],
       [ 0.93797754],
       [-2.17090229],
       [-1.82050583],
       [-0.78764898],
       [ 1.92595938],
       [-2.41739235],
       [ 2.02766944],
       [-2.42340955],
       [ 2.85875679],
       [-0.18809279],
       [ 2.86056653],
       [ 0.62907312]])
ipdb> pretrain_segments[2]['actions']
array([[-1.53344645],
       [-2.0461825 ],
       [-2.33774915],
       [ 0.93797754],
       [-2.17090229],
       [-1.82050583],
       [-0.78764898],
       [ 1.92595938],
       [-2.41739235],
       [ 2.02766944],
       [-2.42340955],
       [ 2.85875679],
       [-0.18809279],
       [ 2.86056653],
       [ 0.62907312]])```

## After

```ipdb> pretrain_segments[2]['actions']
array([[-1.53344645],
       [-2.0461825 ],
       [-2.33774915],
       [ 0.93797754],
       [-2.17090229],
       [-1.82050583],
       [-0.78764898],
       [ 1.92595938],
       [-2.41739235],
       [ 2.02766944],
       [-2.42340955],
       [ 2.85875679],
       [-0.18809279],
       [ 2.86056653],
       [ 0.62907312]])
ipdb> pretrain_segments[12]['actions']
array([[ 1.62348449],
       [-2.11832013],
       [-2.5228675 ],
       [-2.46238179],
       [ 1.03228684],
       [-1.52779674],
       [-0.4767632 ],
       [ 0.34421275],
       [ 2.16330704],
       [ 1.36226558],
       [-1.37803257],
       [-2.2111032 ],
       [-2.66775408],
       [-1.19040819],
       [-1.4272911 ]])```
nottombrown commented 6 years ago

Improved performance on reacher image