Overlap between first evaluation episode and "min_steps_before_learning" in SAC

This might be a minor issue, but I think there is a conflict between the fact that the first global episode is regarded as an evaluation episode, and min_steps_before_learning in the hyperparameters (in SAC). If the global step count is smaller than min_steps_before_learning, we should perform random actions to improve exploration at the start of training. Because of the fact that the first episode is an evaluation episode, the first batch of global steps is used for evaluation of the model instead of exploration. Because of this, the actual number of initial random steps that are used for exploration are drastically decreased (to min_steps_before_learning, minus the amount of steps needed in the first evaluation episode).

This issue can easily be solved in two ways: we can keep the first episode as an evaluation episode and delay the start of random steps to the start of the next episode, or we can use the first min_steps_before_learning for exploration and delay the start of the evaluation episode to the first episode with a first step larger than min_steps_before_learning. I would suggest the former solution. If you want (and if you agree that this overlap is an issue), I can make a pull request where this overlap between exploration and evaluation is solved, so that we can guarantee that the first min_steps_before_learning steps of training are used for random steps.

p-christ / Deep-Reinforcement-Learning-Algorithms-with-PyTorch

Overlap between first evaluation episode and "min_steps_before_learning" in SAC #59