[enhancement] Consistent frequency and bookkeeping variables among callbacks

brendenpetersen commented 6 years ago

The callbacks feature is one of the major reasons to use stable-baselines. However, as is, it is difficult to create an algorithm-agnostic callback function (one that works similarly across all algorithms). A simple use case would be a callback that performs custom evaluation rollouts every 1000 completed training episodes. I'm working on a generic Callback class that I can use for all algorithms, similarly to how stable-baselines makes it easy to create a launcher that works the same for all algorithms, given algorithm name/kwargs. (I can provide an example of this, if helpful.)

There are currently two issues that make a model-agnostic callback function difficult: 1) callback frequency differs among algorithms, and 2) inconsistent bookkeeping variables among algorithms.

Callbacks have different frequency/timing. For example, DDPG calls back every step of every rollout, whereas PPO1 calls back only after every rollout. More subtly, DDPG calls after taking the actual step, whereas PPO1 calls before taking the actual rollout. In the use case described above, this can be handled by finding how many episodes have been completed thus far and only evaluating each time this passes a 1000-episode mark. Finding this information isn't obvious (see issue 2 below), but if using Monitor wrappers it can be obtained from len(env.episode_rewards).
It's difficult to access common bookkeeping information (e.g. the policy, env, or number of completed episodes) in an a way that works for all algorithms. The callback function, say callback(_locals, _globals), can be used to access most information, but it's not consistent among algorithms. For example, most algorithms track the total number of completed episodes, but they have different names, e.g. _locals['episodes'] for DDPG or _locals['episodes_so_far'] for PPO1. More important is accessing and stepping the policy itself. For DDPG the policy object is stepped with _locals['self'].policy_tf.step whereas for PPO1 the policy is _locals['self'].policy_pi.step (or `_locals['self'].step, which is set to the previous variable).

Most of these issues have straightforward fixes. For example, accessing the policy can be made consistent by either defining a step function for the algorithm (e.g. self.step = self.policy_pi.step in PPO1) or making the policy object variable names consistent. Things like counting the number of episodes could be made part of BaseRLModel in case the user isn't using the Monitor wrapper.

araffin commented 6 years ago

Hello, I totally agree with you for the callback, and as you mentioned this is not trivial to have a constant step. Regarding consistency in the variable names, feel free to submit a PR, that's something that I would like to have in stable baselines too ;) (quite busy right now, so help of the community is welcomed =))

rusu24edward commented 4 years ago

I recently had the same issue. I believe that _locals['self'].num_timesteps works for all algorithms. Using this, I've written some generic callbacks that approximately work for all algorithms. For example, save every 1000 with PPO2 won't be exact most of the time because it processes 1024 steps before each callback. But it's close enough for me.

hill-a / stable-baselines

[enhancement] Consistent frequency and bookkeeping variables among callbacks #62