Open balisujohn opened 3 years ago
Check out the fine print regarding LSTMs (granted, this is in small text in one place, so easy to miss ^^).
For LSTM policies, you need to input batch sizes of size n_envs
even during predict
call. See, for example, this answer.
Edit: ah, my brain skipped a paragraph. Try upgrading your stable baselines pip install --upgrade stable-baselines
. It should have a working version of evaluation for recurrent policies. See #1017
Interesting! It looks like I was using an out of date version of stable-baselines. Interestingly it still crashes from an assert statement outside of stable baselines:
Traceback (most recent call last):
File "./minimal_example.py", line 15, in <module>
(mean, std) = evaluate_policy(model,eval_env, n_eval_episodes = 10)
File "/home/john/.local/lib/python3.6/site-packages/stable_baselines/common/evaluation.py", line 63, in evaluate_policy
new_obs, reward, done, _info = env.step(action)
File "/home/john/.local/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 16, in step
observation, reward, done, info = self.env.step(action)
File "/home/john/.local/lib/python3.6/site-packages/gym/envs/classic_control/cartpole.py", line 92, in step
assert self.action_space.contains(action), "%r (%s) invalid"%(action, type(action))
AssertionError: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) (<class 'numpy.ndarray'>) invalid
The eval env should also be a VecEnv (see e.g. this test). Admittedly this error message could be clearer with an assertion check (or automatic wrapping as VecEnv). That would be a nice PR to add :)
Yep, looked like that fixed it! Yeah I'll add a PR with an assertion for that. Thanks for your quick responses; you saved me a lot of time.
(leaving this open until the pull request is ready)
Describe the bug
After training PPO2 in a vectorized environment with a MLPLSTM policy, evaluate_policy() disallows evalutation with vectorized environments via assert but then crashes when evaluated with a non-vectorized environment. As far as I can tell this means evaluate_policy is incompatible with PPO2 policies trained in vectorized environments. I think this is reasonable to consider this a bug since stable baselines is failing by tensorflow crash rather than by an assert in stable baselines.
If possible, I can try fixing the crash, but it would probably be a bit faster for someone with more understanding of the recurrent policy implementation to determine whether this is something which should be fixed, or it should be patched via assert statement which disallows all PPO2 policies trained in vectorized environments to be used with evaluate_policy.
Code example
output
System Info ubuntu 18.04 python 3.6.9
Additional context