hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.1k stars 727 forks source link

[question] What is the proper way to log metrics at the end of each epoch when epochs are variable in length? #1139

Open DavidBellamy opened 2 years ago

DavidBellamy commented 2 years ago

Problem description I am training a PPO model for stock trading using a custom gym environment, called StockTradingEnv. Each "epoch" of training is variable in length, since the epoch ends under two conditions: 1) the agent loses all of its initial money, or 2) the agent reaches the end of the data frame/time series (and has not lost all of its money). I would like to log the net change in the agent's balance at the end of each of these epochs. To do so, I maintain an array within the environment, StockTradingEnv.list_networth, containing the agent's net worth at each time step, and reset it (i.e. empty the array) at the start of each new epoch. I attempted to create a subclass of BaseCallback, called TensorboardCallback, with a very simple _on_step() method – it checks StockTradingEnv.done, and if True, logs the net_change for that epoch (the difference between the values at the last and first indexes of StockTradingEnv.list_networth). However, it appears that PPO is only invoking its callbacks every n_steps and n_steps=1 is not permitted as per the documentation:

:param n_steps: The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps n_envs where n_envs is number of environment copies running in parallel) NOTE: n_steps n_envs must be greater than 1 (because of the advantage normalization)

Even with n_steps=2, it is possible that an epoch ends on, say, step 1001 (not divisible by 2) and thus no net_change will be logged for that epoch.

What is the proper solution using stable-baselines3 to log metrics from the environment systematically at the end of each epoch, when the epoch lengths are not a constant number of steps?

Code For the sake of brevity, I did not include the code for the custom environment here. I can always add this if someone deems it necessary.

import pandas as pd
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback

from env import StockTradingEnv  # a custom gym environment for stock trading

# A custom callback 
class TensorboardCallback(BaseCallback):
    """ Logs the net change in cash between the beginning and end of each epoch/run. """

    def __init__(self, verbose=0):
        super(TensorboardCallback, self).__init__(verbose)
        self.env = self.training_env.envs[0]

    def _on_step(self) -> bool:
        if self.env.done:
            net_change = self.env.list_networth[-1] - self.env.list_networth[0]
            self.logger.record("net_change", net_change)

        return True

# Load training data
WMT_Train = pd.read_csv("WMT_Train.csv")

# Instantiate the custom environment
env = DummyVecEnv([lambda: StockTradingEnv(WMT_Train, start=0, end=10000, look_back=10)])

# Instantiate model
model = PPO('MlpPolicy', env, learning_rate=0.0001, verbose=0, ent_coef=0.5, 
            tensorboard_log="./ppo_log", n_steps=128)

# Fit model using the custom callback
model.learn(total_timesteps=500000, tb_log_name="PPO_log", callback=TensorboardCallback())

System Info

Miffyli commented 2 years ago

You should probably open this issue on stable-baselines3 repository :).

But to answer your question: if I understand correctly you want to log stats after each time PPO is updated. In that case you should use _on_rollout_start and _on_rollout_end (former is called when new samples are collected, latter when the sampling is done. Training also happens once per rollout start/end).

DavidBellamy commented 2 years ago

Thanks for such a quick reply! I tested out your suggestions, but they aren't working for me. I tried modifying the custom callback class to use _on_rollout_end instead:

class TensorboardCallback(BaseCallback):
    """ Logs the net change in cash between the beginning and end of each epoch/run. """

    def __init__(self, verbose=0):
        super(TensorboardCallback, self).__init__(verbose)

    # use the _on_rollout_end method for logging end-of-epoch metrics, rather than _on_step
    def _on_rollout_end(self) -> None:
        net_change = self.training_env.envs[0].list_networth[-1] - self.training_env.envs[0].list_networth[0]
        self.logger.record("net_change", net_change)

    def _on_step(self) -> bool:
        return True

But this still has the same issue as when I used _on_step() – the PPO model is only invoking TensorboardCallback._on_rollout_end every n_steps. So if I unconditionally log metrics in _on_rollout_end, as in the above code snippet, logs are done too often (every n_steps rather than once per epoch). But if I try to add a conditional branch to the logging procedure, such as:

def _on_rollout_end(self) -> None:
    if self.training_env.envs[0].done:  # only log metric if the epoch is 'done'
        net_change = self.training_env.envs[0].list_networth[-1] - self.training_env.envs[0].list_networth[0]
        self.logger.record("net_change", net_change)

Then I run into the other issue as before – because _on_rollout_end is only invoked every n_steps, if the epoch happens to finish on a step number that is not divisible by n_steps (ex. n_steps=2 and epoch finishes on step 1001) then the end-of-epoch metric does not get logged.

Is there a robust way to trigger callbacks at the end of each epoch when the epochs do not have a known length?

Miffyli commented 2 years ago

Is there a robust way to trigger callbacks at the end of each epoch when the epochs do not have a known length?

Hmm I am bit confused about the concept of epoch here. It sounds like what you mean is an episode (from reset to done=True in an environment)? If that is the case, a simple Monitor wrapper (see examples on how to add this) would do the trick which saves data on each individual episode into a csv file you can then load up later. At least, this is what I understood by your description (sorry for not suggesting this earlier, I was under the impression you might have tried this).

DavidBellamy commented 2 years ago

That sounds like exactly what I am looking for! Sorry to cause confusion, my background is supervised learning where there is no concept of "episode", only batches and epochs. Would you mind including a link to the Monitor wrapper examples you made reference to?

Miffyli commented 2 years ago

See how Monitor wrapper is used here: https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#using-callback-monitoring-training

import gym
from stable_baselines3.common.monitor import Monitor
# Create and wrap the environment
env = gym.make('LunarLanderContinuous-v2')
env = Monitor(env, log_dir)

After this info about episodes will be put inside log_dir.