[feature request] Add maximum time steps parameter to evaluation function to protect against infinite episodes

Hi there,

When doing hyper-parameter training with rl-zoo I often accidentally test a parameter that produces invalid or explosive policies for a particular algorithm/environment.

Occasionally this produces a policy where the agent does nothing and in certain environments nothing is a perfectly viable action. Therefore, during the evaluation callback, which calls https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/evaluation.py#L6, the agent can just sit there, doing nothing, because the policy is set to deterministic mode.

In this situation there is no code to prevent the episode continuing for ever. It will get stuck in the while loop: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/evaluation.py#L37

Can I make the proposal to add a max_timesteps safety valve? Or maybe num_timesteps to be consistent with other classes?

One workaround is to add an environment wrapper to force the episode to end when a maximum number of time steps are reached. I would understand if this is recommended instead of writing more code.

Thanks, Phil

hill-a / stable-baselines

[feature request] Add maximum time steps parameter to evaluation function to protect against infinite episodes #876