Open DavidBellamy opened 2 years ago
You should probably open this issue on stable-baselines3 repository :).
But to answer your question: if I understand correctly you want to log stats after each time PPO is updated. In that case you should use _on_rollout_start
and _on_rollout_end
(former is called when new samples are collected, latter when the sampling is done. Training also happens once per rollout start/end).
Thanks for such a quick reply! I tested out your suggestions, but they aren't working for me. I tried modifying the custom callback class to use _on_rollout_end
instead:
class TensorboardCallback(BaseCallback):
""" Logs the net change in cash between the beginning and end of each epoch/run. """
def __init__(self, verbose=0):
super(TensorboardCallback, self).__init__(verbose)
# use the _on_rollout_end method for logging end-of-epoch metrics, rather than _on_step
def _on_rollout_end(self) -> None:
net_change = self.training_env.envs[0].list_networth[-1] - self.training_env.envs[0].list_networth[0]
self.logger.record("net_change", net_change)
def _on_step(self) -> bool:
return True
But this still has the same issue as when I used _on_step()
– the PPO model is only invoking TensorboardCallback._on_rollout_end
every n_steps
. So if I unconditionally log metrics in _on_rollout_end
, as in the above code snippet, logs are done too often (every n_steps
rather than once per epoch). But if I try to add a conditional branch to the logging procedure, such as:
def _on_rollout_end(self) -> None:
if self.training_env.envs[0].done: # only log metric if the epoch is 'done'
net_change = self.training_env.envs[0].list_networth[-1] - self.training_env.envs[0].list_networth[0]
self.logger.record("net_change", net_change)
Then I run into the other issue as before – because _on_rollout_end
is only invoked every n_steps
, if the epoch happens to finish on a step number that is not divisible by n_steps
(ex. n_steps=2
and epoch finishes on step 1001) then the end-of-epoch metric does not get logged.
Is there a robust way to trigger callbacks at the end of each epoch when the epochs do not have a known length?
Is there a robust way to trigger callbacks at the end of each epoch when the epochs do not have a known length?
Hmm I am bit confused about the concept of epoch here. It sounds like what you mean is an episode (from reset
to done=True
in an environment)? If that is the case, a simple Monitor wrapper (see examples on how to add this) would do the trick which saves data on each individual episode into a csv file you can then load up later. At least, this is what I understood by your description (sorry for not suggesting this earlier, I was under the impression you might have tried this).
That sounds like exactly what I am looking for! Sorry to cause confusion, my background is supervised learning where there is no concept of "episode", only batches and epochs. Would you mind including a link to the Monitor wrapper examples you made reference to?
See how Monitor wrapper is used here: https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#using-callback-monitoring-training
import gym
from stable_baselines3.common.monitor import Monitor
# Create and wrap the environment
env = gym.make('LunarLanderContinuous-v2')
env = Monitor(env, log_dir)
After this info about episodes will be put inside log_dir
.
Problem description I am training a
PPO
model for stock trading using a customgym
environment, calledStockTradingEnv
. Each "epoch" of training is variable in length, since the epoch ends under two conditions: 1) the agent loses all of its initial money, or 2) the agent reaches the end of the data frame/time series (and has not lost all of its money). I would like to log the net change in the agent's balance at the end of each of these epochs. To do so, I maintain an array within the environment,StockTradingEnv.list_networth
, containing the agent's net worth at each time step, and reset it (i.e. empty the array) at the start of each new epoch. I attempted to create a subclass ofBaseCallback
, calledTensorboardCallback
, with a very simple_on_step()
method – it checksStockTradingEnv.done
, and if True, logs thenet_change
for that epoch (the difference between the values at the last and first indexes ofStockTradingEnv.list_networth
). However, it appears thatPPO
is only invoking its callbacks everyn_steps
andn_steps=1
is not permitted as per the documentation:Even with
n_steps=2
, it is possible that an epoch ends on, say, step 1001 (not divisible by 2) and thus nonet_change
will be logged for that epoch.What is the proper solution using
stable-baselines3
to log metrics from the environment systematically at the end of each epoch, when the epoch lengths are not a constant number of steps?Code For the sake of brevity, I did not include the code for the custom environment here. I can always add this if someone deems it necessary.
System Info