[question] Using wrappers in EvalCallback

MijnheerD commented 4 years ago

I am running into an issue when trying to use EvalCallback to periodically save the best model learned by a PPO2 agent. The problem is that I use a (custom) gym environment inside the EvalCallback and a wrapper inside the model. When running it raises the error:

AssertionError: the second env differs from the first env

I have tried to explain the context down below, but essentially my question is this: can we use wrappers as the env when creating callbacks? And are we supposed to? Or is this a sign that the code has another issue?

Thank you in advance!

System Info Describe the characteristic of your environment:

Library installed using pip
Python 3.7.2
Tensorflow 1.13.1

Additional context I am working on a project to train an agent to cancel out an incoming wave, using only a few observation points. For this, we have written a custom gym environment Advection_training to train the agent in. I used the check_env function to check the environment and it reports no issues. We use PPO2 to train the agent and save the best model periodically using EvalCallback. We used a SubprocVecEnv inside in the model and the gym environment itself inside the callback, which raised a warning but no error. This is the piece of code we used and runs without an error:

from stable_baselines.common.vec_env import SubprocVecEnv
from Advection_Eqn.Advection_env import Advection_training as CustomEnv #gym like environment

Ambiente = CustomEnv(reward_func=FUNCTION)

env = SubprocVecEnv([lambda: Monitor(Ambiente, log_dir, allow_early_resets=True) for _ in range(num_cpu)])

model = PPO2(MlpPolicy, env, policy_kwargs=policy_kwargs, verbose=1, n_cpu_tf_sess=None,
                 n_steps=int(32000 / num_cpu), nminibatches=100, noptepochs=5, tensorboard_log="./ppo2_advection_tensorboard/"+FUNCTION+"/")

eval_callback = EvalCallback(Ambiente, best_model_save_path='./logs/'+FUNCTION+'/',
                                 log_path='./logs/'+FUNCTION+'/', eval_freq=int(32000 / num_cpu),
                                 deterministic=True, render=False)

time_steps = 40_000  # number of interaction w environment
model.learn(total_timesteps=time_steps, callback=eval_callback)

However, we then wanted to try and incorporate the CuriosityWrapper created by @NeoExtended in #309 . He derived the wrapper from the class BaseTFWrapper. I modified the code as follows:

from stable_baselines.common.vec_env import DummyVecEnv 
from Advection_Eqn.vec_curiosity_reward import CuriosityWrapper #Written by @NeoExtended
from Advection_Eqn.Advection_env import Advection_training as CustomEnv #gym like environment

Ambiente = CustomEnv(reward_func=FUNCTION)

env = DummyVecEnv([lambda: Monitor(Ambiente, log_dir, allow_early_resets=True) for _ in range(num_cpu)])
env = CuriosityWrapper(env, network='mlp')

model = PPO2(MlpPolicy, env, policy_kwargs=policy_kwargs, verbose=1, n_cpu_tf_sess=None,
                 n_steps=int(32000 / num_cpu), nminibatches=100, noptepochs=5, tensorboard_log="./ppo2_advection_tensorboard/"+FUNCTION+"/")

eval_callback = EvalCallback(Ambiente, best_model_save_path='./logs/'+FUNCTION+'/',
                                 log_path='./logs/'+FUNCTION+'/', eval_freq=int(32000 / num_cpu),
                                 deterministic=True, render=False)

time_steps = 40_000  # number of interaction w environment
model.learn(total_timesteps=time_steps, callback=eval_callback)

This raises an error when trying to callback:

Traceback (most recent call last): File "E:\Users\Gebruiker\Documents\GitHub\RL_playground\venv\lib\site-packages\stable_baselines\common\vec_env\__init__.py", line 51, in sync_envs_normalization
    assert isinstance(eval_env_tmp, VecEnvWrapper), "the second env differs from the first env"
AssertionError: the second env differs from the first env

After trying different things, it seems to me that the only way of solving this is to also use a CuriosityWrapper inside the EvalCallback. However, when I do this the code seems to run very slow. Without the callback, its finishes in a couple of minutes, but with the callback it was still running after 1 hour.

araffin commented 4 years ago

After trying different things, it seems to me that the only way of solving this is to also use a CuriosityWrapper inside the EvalCallback

yes, you need to use the same wrappers. But what you can do is add a test_mode argument to that wrapper (that you set to True before passing the env to the Callback) so it does nothing during evaluation.

Also, it seems that you are using the same env multiple times: do:

env = DummyVecEnv([lambda: Monitor(CustomEnv(reward_func=FUNCTION), log_dir, allow_early_resets=True) for _ in range(num_cpu)])

instead of:


Ambiente = CustomEnv(reward_func=FUNCTION)
env = DummyVecEnv([lambda: Monitor(Ambiente, log_dir, allow_early_resets=True) for _ in range(num_cpu)])

MijnheerD commented 4 years ago

Thank you for your help! It works now :) For anyone who is also stuck on this, here is what I did:

The code was looping forever because the parameter filter_end_of_episode was put to True. This set done=True by default, never ending the loop
Next I also had to change the reward returned by CuriosityWrapper as it was returned as an 0D array. By reshaping it to a 1D array it worked just fine

hill-a / stable-baselines

[question] Using wrappers in EvalCallback #1014