[Bug]: WandB last episode log in SB3 algorithms

Bug 🐛

When we train with SB3 and use the WandB wrapper, the summary metrics record an almost empty last episode. This is because SB3 algorithms can perform some extra timesteps depending on the algorithm. The learn method of the model does not allow to specify number of episodes, only timesteps, so it is unavoidable.

This worsens the experience with the platform, since the summary tables it generates can only show the last result, maximum or minimum. Being the last result useless.

This only happens with WandBLogger, CSVLogger does not have this problem at the moment.

Expected behavior

It is expected that episodes outside of those specified will not be recorded in WandB. For example, if 100 episodes are specified, to log those 100 and ignore 101.

:pencil: Please, don't forget to include more labels besides bug if it is necessary.

ugr-sail / sinergym

[Bug]: WandB last episode log in SB3 algorithms #431

Bug 🐛

Expected behavior