Closed SerialIterator closed 5 years ago
Hey,
Well this seems to be on OpenAI's side. At the CartPole render function there are no checks for whether a rendering window was asked, or an RGB image
Normaly, when mode=rgb_image
is used, no rendering is done, as definied by the Gym doc:
def render(self, mode='human'):
"""Renders the environment.
The set of supported modes varies per environment. (And some
environments do not support rendering at all.) By convention,
if mode is:
- human: render to the current display or terminal and
return nothing. Usually for human consumption.
- rgb_array: Return an numpy.ndarray with shape (x, y, 3),
representing RGB values for an x-by-y pixel image, suitable
for turning into a video.
- ansi: Return a string (str) or StringIO.StringIO containing a
terminal-style text representation. The text can include newlines
and ANSI escape sequences (e.g. for colors).
So you get a rendering window for each environment due to CartPole, and one tiled one from SubprocVecEnv.
If you want to avoid this display issue, but keep the SubProcVecEnv
, recreate the vectorized environment for the rendering code, but with only one environment:
...
model.learn(total_timesteps=25000)
env = DummyVecEnc([make_env(env_id, 0)])
obs = env.reset()
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Its a stopgap fix, but it is better than 5 windows.
That works but does it mean that after I've trained a model I have to load all envs into memory just to use one of them for testing?
Expect for LSTM policies, the predict()
method only needs an observation or batch of observations (cf documentation) so you can use as many env as you want (e.g. only one) for testing.
For LSTMPolicies, you need to feed the predict method with the same observation as for training, which depends on the number of envs (to test it with only one env, a trick can consist in completing the batch of observations with zeros).
Once I get the model to converge, I'll probably need to pick your brain some more about the all zeros trick
To make it clearer, for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape)
, so if you want to test with only one env, construct an ndarray of shape (1, obs_space.shape)
and then concatenate it with zeros to create the final ndarray.
Note: the shape may change (not sure if it is n_envs
or minibatch_size
) but at least you got the idea.
Hi @araffin , I followed your comments above but am really struggling to get it to work. I am using an LSTM policy with Subprocvecenv. My code is below:
env = DummyVecEnv([self.make_env(test_gym, 0)])
# for LSTMPolicies, the predict method expect a shape of (n_envs, obs_space.shape),
# so if you want to test with only one env,
# construct an ndarray of shape (1, obs_space.shape) and then
# concatenate it with zeros to create the final ndarray.
obs = env.reset()
zeroes = np.zeros(shape=(n_envs - 1, env.observation_space.shape[1]))
obs = np.concatenate((obs, zeroes), axis=0)
print(obs.shape)
for _ in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
With the above code, although print(obs.shape)
gives me: (8, 1, 77)
, I get the following error when attempting to predict: ValueError: Cannot feed value of shape (1, 1, 77) for Tensor 'input/Ob:0', which has shape '(8, 1, 77)
Any ideas? Did I understand your comments correctly?
Hello,
You can find below a working example:
import gym
import numpy as np
from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
def make_env():
def maker():
env = gym.make("CartPole-v1")
return env
return maker
# Train with 2 envs
n_training_envs = 2
envs = DummyVecEnv([make_env() for _ in range(n_training_envs)])
model = PPO2("MlpLstmPolicy", envs, nminibatches=2)
# Create one env for testing
test_env = DummyVecEnv([make_env() for _ in range(1)])
test_obs = test_env.reset()
# model.predict(test_obs) would through an error
# because the number of test env is different from the number of training env
# so we need to complete the observation with zeroes
zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape)
zero_completed_obs[0, :] = test_obs
# IMPORTANT: with recurrent policies, don't forget the state
state = None
action, state = model.predict(zero_completed_obs, state=state)
# The test env is expecting only one action
new_obs, reward, done, info = test_env.step([action[0]])
# Update the obs
zero_completed_obs[0, :] = new_obs
Please look at the documentation on how to use recurrent policies during testing, here you were forgetting the state.
This is the code used for prediction:
n_cpu = 1
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = SubprocVecEnv([lambda: env for _ in range(n_cpu)])
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
model = PPO2.load(mdl)
# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_cpu,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs
state = None
# state = model.initial_state # get the initial state vector for the reccurent network
# done = np.zeros(state.shape[0]) # set all environment to not done
weights, state = model.predict(zero_completed_obs, state)
# print(weights)
return weights, settings
I get this error in model.predict:
<class 'ValueError'>
Traceback (most recent call last):
File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
position, settings = TSobject.myTradingSystem(*argList)
File "ppo2_quantiacs_test.py", line 47, in myTradingSystem
weights, state = model.predict(zero_completed_obs, state)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 472, in predict
actions, _, states, _ = self.step(observation, state, mask, deterministic=deterministic)
File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\policies.py", line 508, in step
{self.obs_ph: obs, self.states_ph: state, self.dones_ph: mask})
File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\hanna\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 675) for Tensor 'input/Ob:0', which has shape '(12, 675)'
Please read carefully my example, you have to use n_training_envs
not n_cpu
.
Whats the difference between n_training_envs and n_cpu? Just a name of a variable.
You trained your agent with 12 envs (according to the error) and want to test it with only one.
But here, n_cpu != n_training_envs
, so you get an error.
I changed it according to your example:
n_env = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allow_short'], reward=settings['reward'])
env = DummyVecEnv([lambda: env for _ in range(1)])
mdl = 'futures_20100101_20180101_5000000_2000_3_return_False'
# mdl = 'futures_20100101_20180101_5000000_2000_3_return_False_c7616a5f58b141aa989379427458bbe8'
model = PPO2.load(mdl)
# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs
state = None
# state = model.initial_state # get the initial state vector for the reccurent network
# done = np.zeros(state.shape[0]) # set all environment to not done
pos, state = model.predict(zero_completed_obs, state)
Still get:
ValueError: could not broadcast input array from shape (12,45) into shape (45)
I guess that I have to take the first row of the pos matrix? pos[0] ?
I guess that I have to take the first row of the pos matrix? pos[0] ?
ok, you did not show all the code. Sure, your test env is expecting only one action, and please try by yourself before asking question for each step.
EDIT: I updated the example accordingly
Still have a problem in
pos, state = model.predict(zero_completed_obs, state, done)
ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.
Model was trained with n_env = 12
Where this 10 comes from?
Still have a problem in
pos, state = model.predict(zero_completed_obs, state, done)
ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.
Model was trained with n_env = 12
Where this 10 comes from?
A few things:
Your issue will not be addressed if you do not follow the format described in the issue template (https://github.com/hill-a/stable-baselines/blob/master/.github/ISSUE_TEMPLATE/issue-template.md)
n_env = 12 env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'], include_ta=settings['include_ta'],allow_short=settings['allowshort'], reward=settings['reward'], debug=settings['debug']) env = SubprocVecEnv([lambda: env for in range(1)])
mdl = 'ES_19900102_20180101_5000000_7000_1_return_False_7a686c53e4a34338942a8b4bbe65fa47'
model = PPO2.load(mdl)
# intialized here
obs = env.reset()
zero_completed_obs = np.zeros((n_env,) + env.observation_space.shape)
zero_completed_obs[0, :] = obs
state = None
state = model.initial_state
done = np.zeros(state.shape[0])
pos, state = model.predict(zero_completed_obs, state, done)
Traceback (most recent call last): File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts position, settings = TSobject.myTradingSystem(*argList) File "ppo2_quantiacs_test.py", line 68, in myTradingSystem pos, state = model.predict(zero_completed_obs, state, done) File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 469, in predict vectorized_env = self._is_vectorized_observation(observation, self.observation_space) File "C:\Users\hanna\Anaconda3\lib\site-packages\stable_baselines\common\base_class.py", line 399, in _is_vectorized_observation .format(", ".join(map(str, observation_space.shape)))) ValueError: Error: Unexpected observation shape (12, 5) for Box environment, please use (10,) or (n_env, 10) for the observation shape.
Reading code in pure text is not pleasent, and only take a few seconds for you to do.
Also, you are not using the latest version of stable-baselines, you must :
as you will see that it says to describe with version of stable-baselines you have.
You are loading a model expecting a (n_env, 10) for the observation shape. It is an explicite message.
Ok it was my mistake I relieve it now. I getting in prediction Nan. Anyway I emailed you and Antonin privately. Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data. I hope that you can help and finish this once and for all.
I getting in prediction Nan.
You might want to have a look a this : https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html It will help to find the NaNs in your code, specifically the VecCheckNan wrapper: https://stable-baselines.readthedocs.io/en/master/guide/checking_nan.html#vecchecknan-wrapper
Even if I do not get Nan, It is not working on new unseen data and in fact it does not even work when testing on same trained data.
Reinforcement learning is not a magic bullet, it is in no way garanted to work all the time on every problem. For mathematical reference see the no free lunch theorem, which states:
Any two optimization algorithms are equivalent when their performance is
averaged across all possible problems
including random optimization algorithms.
You might want to try some tricks like VecFrameStacking, VecNormalize, or hyperparam search to help the algorithm optimize the way you would like.
I hope that you can help and finish this once and for all.
If you believe you have found a bug in the code of stable-baselines, and can show it reliably: We will adresse it.
If you need techsupport or consulting: We will not help
We do not have the time, nor the obligation for consulting on stable-baselines. The library is "as is", as described in the MIT licence: https://github.com/hill-a/stable-baselines/blob/master/LICENSE.
I understand that you do not have any obligation to counsel. I am trying to implement this: http://www-scf.usc.edu/~zhan527/post/cs599/ with stable baseline. In the original article it does work, even on unseen data. He created his own ddpg agent, and I understand that PPO suppsoe to be better.
In the original article it does work, even on unseen data.
Correction, on the given unseen data. it is possible to generate data that will not give a positive result for the algorithm. That is the hole point of adversarial learning.
He created his own ddpg agent, and I understand that PPO suppsoe to be better.
How did you get that impression? both have advantages and disadvantages.
EDIT: if you are trying to replicate the results of the blogpost, why dont you use their hyperparameters with DDPG?
If that fails, then try and find the underlying implementation differences between the blogpost's DDPG and stable-baselines's DDPG?
In fact, why use stable-baselines at all, they have a github repo of their solution: https://github.com/vermouth1992/drl-portfolio-management
I know that they have github repository with their code. There are other similar works on github, for example https://github.com/yuriak/RLQuant or https://github.com/liangzp/Reinforcement-learning-in-portfolio-management- I was hoping that stable baseline will let me test various agents and not be confined to ddpg only. In addition, stable baseline has tensorboard integration. In any case, at this point I still believe that the problem is with my code and not the agent or hyper parameters. The original work is actually this: https://arxiv.org/abs/1808.09940
Locking issue, diverging too much from the original message.
When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.
System Info Describe the characteristic of your environment:
Additional context