hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.13k stars 724 forks source link

GAIL - SAC: Cannot feed value of shape (1024,) for Tensor 'expert_actions_ph:0', which has shape '(?, 2)' #310

Closed ghost closed 5 years ago

ghost commented 5 years ago

Describe the bug Hi, I'm working on GAIL.Providing my own trajectory just as in the cartpole or pendulum example. It works with DQN and PPO but when I try to use it with algorithms DDPG or SAC I get "Cannot feed value of shape (1024,) for Tensor 'expert_actions_ph:0', which has shape '(?, 2)" tensorflow error. I do provide non discrete, Box action space as you see from the shape. SAC and DDPG works with the same environment but it gives out this error when I incorporate it GAIL as you do in the example you show in it's page. Is there something am I missing or is this a bug?

Code example

train_env = DummyVecEnv([lambda: Environment(mode="train", pair=PAIR, interval=INTERVAL,  algo=ALGO, data_features=FEATURES)])
model = SAC('MlpPolicy', train_env, verbose=1)
dataset = ExpertDataset(expert_path='/Users/apple/Desktop/dev/tools/expert_pr.npz', traj_limitation=1, verbose=1)
model = GAIL("MlpPolicy", train_env, dataset, verbose=1)
-> self.action_space = spaces.Box(low=np.array([-len(self.actions), -1]), high=np.array([len(self.actions), 1]),  dtype=np.float32)
(Pdb) self.action_space = spaces.Box(low=np.array([-len(self.actions), -1]), high=np.array([len(self.actions), 1]),  dtype=np.float32)
(Pdb) np.array([len(self.actions), 1])
array([3, 1])
(Pdb) np.array([-len(self.actions), -1])
array([-3, -1])
(Pdb) self.observation_space = spaces.Box(low=-self.nfeatures, high=self.nfeatures, shape=self.shape, dtype=np.float32)
(Pdb) self.nfeatures
3
(Pdb) self.shape
(3,)
(Pdb) 

System Info Describe the characteristic of your environment:

Additional context

Colocations handled automatically by placer.
obs (7562, 3)
actions (7562,)
rewards (7562,)
episode_starts (7562,)
episode_returns (1,)
Total trajectories: 1
Total transitions: 7562
Average returns: 153.6864008635515
Std for returns: 0.0
WARNING:tensorflow:From /Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
********** Iteration 0 ************
Optimizing Policy...
sampling
done in 3.941 seconds
computegrad
done in 0.659 seconds
conjugate_gradient
      iter residual norm  soln norm
         0    0.00505          0
         1   0.000359     0.0415
         2    9.3e-09     0.0484
         3   7.82e-11     0.0494
done in 0.886 seconds
Expected: 0.008 Actual: 0.008
Stepsize OK!
vf
done in 0.303 seconds
sampling
done in 2.805 seconds
computegrad
done in 0.014 seconds
conjugate_gradient
      iter residual norm  soln norm
         0    0.00763          0
         1   1.72e-05     0.0437
         2   1.72e-07      0.044
         3   7.88e-07     0.0441
         4   7.21e-11     0.0583
done in 0.046 seconds
Expected: 0.009 Actual: 0.009
Stepsize OK!
vf
done in 0.123 seconds
sampling
done in 2.725 seconds
computegrad
done in 0.027 seconds
conjugate_gradient
      iter residual norm  soln norm
         0    0.00838          0
         1   0.000262     0.0494
         2   1.11e-05     0.0535
         3   4.62e-06      0.054
         4   3.81e-12       0.19
done in 0.021 seconds
Expected: 0.010 Actual: 0.010
Stepsize OK!
vf
done in 0.113 seconds
Optimizing Discriminator...
generator_loss |   expert_loss |       entropy |  entropy_loss | generator_acc |    expert_acc
Traceback (most recent call last):
  File "train_sb.py", line 20, in <module>
    model.learn(total_timesteps=TRAIN_TIMESTEPS)
  File "/Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/stable_baselines/gail/model.py", line 78, in learn
    self.trpo.learn(total_timesteps, callback, seed, log_interval, tb_log_name, reset_num_timesteps)
  File "/Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/stable_baselines/trpo_mpi/trpo_mpi.py", line 441, in learn
    *newlosses, grad = self.reward_giver.lossandgrad(ob_batch, ac_batch, ob_expert, ac_expert)
  File "/Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/stable_baselines/common/tf_util.py", line 300, in __call__
    results = sess.run(self.outputs_update, feed_dict=feed_dict, **kwargs)[:-1]
  File "/Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/Users/apple/miniconda3/envs/projectlife/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1024,) for Tensor 'expert_actions_ph:0', which has shape '(?, 2)'
araffin commented 5 years ago

Hello, it seems your action space has an extra dimension, shouldn't it be (3,) instead of (3, 1)?

ghost commented 5 years ago

I just solved it..made some changes on record_expert.py to correctly generate trajectories of my environment and SAC works now. Thank you for your work in this great repo!