hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.12k stars 726 forks source link

[feature request] Add maximum time steps parameter to evaluation function to protect against infinite episodes #876

Open philwinder opened 4 years ago

philwinder commented 4 years ago

Hi there,

When doing hyper-parameter training with rl-zoo I often accidentally test a parameter that produces invalid or explosive policies for a particular algorithm/environment.

Occasionally this produces a policy where the agent does nothing and in certain environments nothing is a perfectly viable action. Therefore, during the evaluation callback, which calls https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/evaluation.py#L6, the agent can just sit there, doing nothing, because the policy is set to deterministic mode.

In this situation there is no code to prevent the episode continuing for ever. It will get stuck in the while loop: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/evaluation.py#L37

Can I make the proposal to add a max_timesteps safety valve? Or maybe num_timesteps to be consistent with other classes?

One workaround is to add an environment wrapper to force the episode to end when a maximum number of time steps are reached. I would understand if this is recommended instead of writing more code.

Thanks, Phil

araffin commented 4 years ago

Hello,

In fact I encountered the same issue with Atari games...

Can I make the proposal to add a max_timesteps safety valve?

Sounds like a simple and acceptable solution ;) We would appreciate a PR for that ;) (btw, once it is merged, you could do almost the same PR to SB3, so we keep both repos in sync).