[WiP] Reproducible on- and off-policy sampling

rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.

MIT License

1.88k stars 310 forks source link

[WiP] Reproducible on- and off-policy sampling #2185

Open MkuuWaUjinga opened 3 years ago

MkuuWaUjinga commented 3 years ago

Extend the Environment API to support setting environment library specific seeds.

Tasks:

[x] Extend Environment interface
[x] Set seeds for Gym envs
[x] Ensure seeds are set when Sampler classes start working
[x] Set seeds for dm_control envs
[ ] Set seeds for grid_world envs
[ ] Set seeds for point envs
[ ] Set seeds for metaworld envs

Open Questions:

How to ensure determinism of off-policy algorithms?

MkuuWaUjinga commented 3 years ago

Thanks for the pointers. Addressed everything in the latest commits. I assume GridWorld and PointEnv don't have any seeds at all? Furthermore, with the implementation right now every worker has the same environment seed. This means that each worker always samples the same trajectory given a fixed action sequence. I think this is something we need to fix before merging?