Episode rewards not updated before being used by callback.on_step()

calerc commented 3 years ago

The following applies to DDPG and TD3, and possibly other models. The following libraries were installed in a virtual environment:

numpy==1.16.4 stable-baselines==2.10.0 gym==0.14.0 tensorflow==1.14.0

Episode rewards do not seem to be updated in model.learn() before callback.on_step(). Depending on which callback.locals variable is used, this means that:

episode rewards may not be available until after the beginning of the next episode
reported episode rewards may not include the reward for the last step of the episode.

Also the callback.locals episode reward variables are different for DDPG and TD3, meaning that a callback that is useful for both models has to account for differences in episode reward variable names and types.

The following code reproduces the error for DDPG and TD3:

from gym import spaces, Env
from stable_baselines import DDPG, TD3
from stable_baselines.common.callbacks import BaseCallback
import numpy as np

NUM_STEPS = 5
MODELS = [DDPG, TD3]

'''
    Callback()
    A simple callback function that prints the episode number and reward
'''
class Callback(BaseCallback):

    def __init__(self, model):
        super(Callback, self).__init__()
        self.count = 0
        self.model = model

    def _on_step(self) -> bool:

        if self.training_env.done:
            self.count += 1
            if type(self.model) is DDPG:

                # 1) We should be able to use episode_reward instead of epoch_episode_reward,
                #   but neither is updated until after the callback.  This means that the episode reward is not available until the next episode has begun
                # 3) "episode_reward", a scalar that could be used for DDPG, is different than "episode_rewards"
                #   a list that could be used for TD3.  Callbacks that are designed for both DDPG or TD3 have to
                #   handle the discrepancy in variable types and names
                if len(self.locals['epoch_episode_rewards']) >  0:
                    reward = self.locals['epoch_episode_rewards'][-1]
                    print('Episode: ' + str(self.count) + ' | Reward: ' + str(reward))
                else:
                    print('-------- Episode 1 is missing b/c the episode_rewards has not been updated -------')

            if type(self.model) is TD3:
                # 2) episode_rewards is not updated to include the last reward from an episode BEFORE being
                #       used by the callback
                reward = self.locals['episode_rewards'][-1]
                print('Episode: ' + str(self.count) + ' | Reward: ' + str(reward))

        return True

'''
    TestEnv()
    A simple environment that ignores the effects of actions
    Episodes always last for NUM_STEPS steps
    For the last step, a reward of +1 is given, regardless of the action
    For every other step, a reward of +0.1 is given, regardless of the action
    For NUM_STEPS = 5, the reward for each episode should be 4 * 0.1 + 1 * 1 = 1.4
'''
class TestEnv(Env):

    def __init__(self):        
        self.action_space = spaces.Box(np.asarray([0]), np.asarray([1]), dtype=np.float32)
        self.observation_space = spaces.Box(np.asarray([0]), np.asarray([1]), dtype=np.float32)
        self.reset()

    def step(self, action):
        self.count += 1
        obs = np.asarray([1])

        reward = 0.1      
        self.done = False
        if self.count == NUM_STEPS:
            reward = 1
            self.done = True

        info = {'is_success': False}

        return obs, reward, self.done, info

    def reset(self):
        self.count = 0

'''
    Construct a DDPG and a TD3 model and demonstrate the bugs in the model.learn() functions.
    In both cases, episode rewards are not updated before being passed to the callbacks
    The bug is present in stable-baselines 2.10.0
    DDPG and TD3 may not be the only classes effected
'''
if __name__ == '__main__':

    env = TestEnv()

    for m in MODELS:

        callback = Callback(model=m)
        model = m('MlpPolicy', env, random_exploration=0)
        print('--------------------------------------------------')
        print(str(m))
        print('Each reward should be 1.4, and there should be 20 episodes printed')
        model.learn(100, callback=callback)
        print('--------------------------------------------------')

Miffyli commented 3 years ago

This should be fixed in 2.10.1 so try installing stable-baselines=2.10.1 (see #787 and changelog). See if that works.

calerc commented 3 years ago

Installing stable-baselines=2.10.1 did not work. Looking at TD3.learn() version 2.10.1:

the rewards for each step are returned from self.env.step() on line 330
the locals are updated on line 336
callback.on_step() is called on line 337
episode_rewards is updated on line 394.

Since callback.on_step() has access to the correct reward for the step, but not the correct reward for the episode, the problem could be solved by having the callback keep track of the episode rewards. But, it seems that calling callback.on_step() after episode_rewards[-1] += reward_ (or equivalent for other models) would be a more robust solution.

araffin commented 3 years ago

Hello,

If you want a robust way to retrieve episode reward variable, you should use a Monitor wrapper together with a callback. This is what we do in Stable-Baselines3.

In fact, depending on what you really want to do, you could possibly only use a gym.Wrapper.

hill-a / stable-baselines

Episode rewards not updated before being used by callback.on_step() #1046