google-deepmind / bsuite

bsuite is a collection of carefully-designed experiments that investigate core capabilities of a reinforcement learning (RL) agent
Apache License 2.0
1.51k stars 182 forks source link

bsuite_tutorial problem when build PPO OpenAI baseline agent #18

Closed lingjunz closed 4 years ago

lingjunz commented 4 years ago

There is a small problem I had when building PPO OpenAI baseline agent in the bsuite_tutorial.

from baselines.common.vec_env import dummy_vec_env
from baselines.ppo2 import ppo2
from bsuite.utils import gym_wrapper
import tensorflow as tf

SAVE_PATH_PPO = './demo_results/bsuite/ppo'
def _load_env():
raw_env = bsuite.load_and_record(
bsuite_id='bandit_noise/0', 
save_path=SAVE_PATH_PPO, logging_mode='csv', overwrite=True)
return gym_wrapper.GymFromDMEnv(raw_env)
env = dummy_vec_env.DummyVecEnv([_load_env])
steps,episode,total_return,episode_len,episode_return,total_regret
1,1,[49.09808016],1,[0.67640523],[51.5]
2,2,[49.09808016],1,[0.74001572],[51.5]
3,3,[49.09808016],1,[0.7978738],[51.5]
4,4,[49.09808016],1,[0.62408932],[51.5]

output input shape is (1, 1)

AssertionError Traceback (most recent call last)

in 1 ppo2.learn( 2 env=env, network='mlp', lr=1e-3, gamma=.99, ----> 3 total_timesteps=10000, nsteps=100) ~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/ppo2/ppo2.py in learn(network, env, total_timesteps, eval_env, seed, nsteps, ent_coef, lr, vf_coef, max_grad_norm, gamma, lam, log_interval, nminibatches, noptepochs, cliprange, save_interval, load_path, model_fn, **network_kwargs) 177 # or if it's just worse than predicting nothing (ev =< 0) 178 # print( returns.shape,values.shape) --> 179 ev = explained_variance(values, returns) 180 logger.logkv("misc/serial_timesteps", update*nsteps) 181 logger.logkv("misc/nupdates", update) ~/anaconda3/envs/drl/lib/python3.6/site-packages/baselines/common/math_util.py in explained_variance(ypred, y) 34 35 """ ---> 36 assert y.ndim == 1 and ypred.ndim == 1 37 vary = np.var(y) 38 return np.nan if vary==0 else 1 - np.var(y-ypred)/vary AssertionError: ``` - I found this due to mismatched shape of values(100, 1) and returns(10000, 1) before `explained_variance(values, returns)`. - When I add one line in 'baselines/ppo2/runner.py', it seems to run correctly. ``` ... #batch of steps to batch of rollouts mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype) mb_rewards = np.asarray(mb_rewards, dtype=np.float32) mb_actions = np.asarray(mb_actions) mb_values = np.asarray(mb_values, dtype=np.float32) mb_values = mb_values.reshape(mb_rewards.shape) <<< add this line mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32) mb_dones = np.asarray(mb_dones, dtype=np.bool) last_values = self.model.value(tf.constant(self.obs))._numpy() ... ``` - final result ``` Stepping environment... -------------------------------------------- | eplenmean | nan | | eprewmean | nan | | fps | 271 | | loss/approxkl | 2.5486004e-08 | | loss/clipfrac | 0.0 | | loss/policy_entropy | 2.3978922 | | loss/policy_loss | -2.7894964e-09 | | loss/value_loss | 0.061606925 | | misc/explained_variance | 0 | | misc/nupdates | 100 | | misc/serial_timesteps | 10000 | | misc/time_elapsed | 37.5 | | misc/total_timesteps | 10000 | -------------------------------------------- ``` - p.s. I use tf2.1.0 and checkout to tf2 branch after git clone baselines.
aslanides commented 4 years ago

Hi there! Thanks for the detailed bug report. It seems like this is potentially an issue with the ppo baseline, which is outside the scope of bsuite.

I do notice you mention that you're using TF2, but as far as I can tell, the OpenAI baselines require TF 1.x to run -- could this be part of the issue?