[question] Issue with results_plotter

kosmylo commented 4 years ago

I use results_plotter to plot the episode reward at the end of the training. Recently for no reason, the plot does not show the mean reward from the beginning of the episode, but after a certain timestep as follows:

Figure_1

I use the following script for training:


import os
import read_params
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

from environment import ChargingStation

from stable_baselines.td3.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines.common.noise import OrnsteinUhlenbeckActionNoise
from stable_baselines import TD3
from stable_baselines.bench import Monitor
from stable_baselines import results_plotter

params, profiles = read_params.Charging_Station_Params()

# Create unique log dir
log_dir = "/tmp/td3/"
os.makedirs(log_dir, exist_ok = True)

env = ChargingStation()
env = Monitor(env, log_dir, allow_early_resets = True)
env = DummyVecEnv([lambda: env])

# Automatically normalize the input features and rewards and stack the previous observations
env = VecNormalize(env, norm_obs = True, norm_reward = True, clip_obs = 10.)

# the noise objects for TD3
n_actions = env.action_space.shape[-1]
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

# Custom MLP policy 
policy_kwargs = dict(act_fun = tf.nn.relu, layers = [256, 256, 256])
buffer_size = 100000
gamma = 0.999

model = TD3(MlpPolicy, env, gamma = gamma, policy_kwargs = policy_kwargs, buffer_size = buffer_size, verbose = 1, action_noise = action_noise, tensorboard_log= log_dir + "/td3_ev_charging_tensorboard/")

model.learn(total_timesteps = params.time_steps)

# Don't forget to save the VecNormalize statistics when saving the agent
model.save(log_dir + "td3_ev_charging")
env.save(os.path.join(log_dir, "vec_normalize.pkl"))

# Plot learning curve
results_plotter.plot_results([log_dir], params.time_steps, results_plotter.X_TIMESTEPS, "TD3 ChargingStation")
plt.show()

Any idea of what is happening?

araffin commented 4 years ago

Any idea of what is happening?

This is called moving mean (it uses a moving window to compute the mean) to reduce the noise in the episodic reward. https://stable-baselines.readthedocs.io/en/master/misc/results_plotter.html#stable_baselines.results_plotter.rolling_window

Anyway, as mentioned in the doc, I would recommend using the rl zoo and EvalCallback to monitor the true performance and not the one during training.

kosmylo commented 4 years ago

I can understand that this is the moving mean, but why it does not compute the mean for the first few timesteps? If you check in the image, it does not plot the mean (blue line) from the beginning.

araffin commented 4 years ago

why it does not compute the mean for the first few timesteps?

How do you compute the mean of 2 elements using a moving window of size 10? You could say that this would be the mean of those 2 elements but this is not really satisfying (and also we have an implementation that uses 1D conv https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/results_plotter.py#L30 and thus does not work like that). The mean is only defined when the number of timesteps is above the window size, so it does not start a t=0.

hill-a / stable-baselines

[question] Issue with results_plotter #908