hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

Deepq train custom cartpole doesn't work #810

Closed alex-deineha closed 4 years ago

alex-deineha commented 4 years ago

I tried to execute this code from custom_cartpole.py using stable-baselines and tf 1.14

import itertools
import argparse

import gym
import numpy as np
import tensorflow as tf

import stable_baselines.common.tf_util as tf_utils
from stable_baselines import logger, deepq
from stable_baselines.common.buffers import ReplayBuffer
from stable_baselines.deepq.policies import FeedForwardPolicy
from stable_baselines.common.schedules import LinearSchedule

class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           layers=[64],
                                           feature_extraction="mlp")

with tf.Graph().as_default():

  with tf_utils.make_session(8) as sess:
      # Create the environment
      env = gym.make("CartPole-v0")
      # Create all the functions necessary to train the model
      act, train, update_target, _ = deepq.build_train(
          q_func=CustomPolicy,
          ob_space=env.observation_space,
          ac_space=env.action_space,
          optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
          sess=sess
      )
      # Create the replay buffer
      replay_buffer = ReplayBuffer(50000)
      # Create the schedule for exploration starting from 1 (every action is random) down to
      # 0.02 (98% of actions are selected according to values predicted by the model).
      exploration = LinearSchedule(schedule_timesteps=10000, initial_p=1.0, final_p=0.02)

      # Initialize the parameters and copy them to the target network.
      tf_utils.initialize()
      update_target()

      episode_rewards = [0.0]
      obs = env.reset()
      for step in itertools.count():
          # Take action and update exploration to the newest value
          action = act(obs[None], update_eps=exploration.value(step))[0]
          new_obs, rew, done, _ = env.step(action)
          # Store transition in the replay buffer.
          replay_buffer.add(obs, action, rew, new_obs, float(done))
          obs = new_obs

          episode_rewards[-1] += rew
          if done:
              obs = env.reset()
              episode_rewards.append(0)

          if len(episode_rewards[-101:-1]) == 0:
              mean_100ep_reward = -np.inf
          else:
              mean_100ep_reward = round(float(np.mean(episode_rewards[-101:-1])), 1)

          is_solved = step > 100 and mean_100ep_reward >= 200

          # if args.no_render and step > args.max_timesteps:
          #     break

          if is_solved:
              if args.no_render:
                  break
              # Show off the result
              env.render()
          else:
              # Minimize the error in Bellman's equation on a batch sampled from replay buffer.
              if step > 1000:
                  obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(32)
                  train(obses_t, actions, rewards, obses_tp1, dones, np.ones_like(rewards))
              # Update target network periodically.
              if step % 1000 == 0:
                  update_target()

          if done and len(episode_rewards) % 10 == 0:
              logger.record_tabular("steps", step)
              logger.record_tabular("episodes", len(episode_rewards))
              logger.record_tabular("mean episode reward", mean_100ep_reward)
              logger.record_tabular("% time spent exploring", int(100 * exploration.value(step)))
              logger.dump_tabular()

Describe the bug I investigated the problem a bit. Here most likely the problem is that we are trying to feed dones to network as input.

ValueError                                Traceback (most recent call last)
<ipython-input-5-97d09ee10f26> in <module>()
     77                   obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(32)
     78                   print(obses_t.shape, actions.shape, rewards.shape, obses_tp1.shape, dones.shape)
---> 79                   train(obses_t, actions, rewards, obses_tp1, dones, np.ones_like(rewards))
     80               # Update target network periodically.
     81               if step % 1000 == 0:

2 frames
/usr/local/lib/python3.6/dist-packages/stable_baselines/common/tf_util.py in __call__(self, sess, *args, **kwargs)
    328         for inpt in self.givens:
    329             feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
--> 330         results = sess.run(self.outputs_update, feed_dict=feed_dict, **kwargs)[:-1]
    331         return results
    332 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    948     try:
    949       result = self._run(None, fetches, feed_dict, options_ptr,
--> 950                          run_metadata_ptr)
    951       if run_metadata:
    952         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1147                              'which has shape %r' %
   1148                              (np_val.shape, subfeed_t.name,
-> 1149                               str(subfeed_t.get_shape())))
   1150           if not self.graph.is_feedable(subfeed_t):
   1151             raise ValueError('Tensor %s may not be fed.' % subfeed_t)

ValueError: Cannot feed value of shape (32,) for Tensor 'deepq/double_q/input/Ob:0', which has shape '(?, 4)'

System Info Describe the characteristic of your environment:

Miffyli commented 4 years ago

Good catch. The issues is in sampling (or storing) observations to the replay buffer. After sampling you should have obs of shape (32, 4), but for some reason it ends up with a vector of (32) elements. Recent PR added "extend" version of the add, but that did not modify the original add, and on a quick glimpse I do not see where the error could be.

alex-deineha commented 4 years ago

Well, I tried to investigate it a bit, obses_t, obses_tp1 have correct shape (32, 4) The problem is in dones, the train method tries to put dones as input to neural network. A checker that by changing the size of dones, and exception text changed too.

Miffyli commented 4 years ago

Seems like the call signature has changed at some point. Looking at DQN code using that tf function, obses_tp1 are provided twice (for some reason?), while this code does not do that.

araffin commented 4 years ago

I think we should delete custom_cartpole it is an old code that does not follow the interface and the best practices from SB.

(for some reason?

I think it is a typo, I don't see any reason.

araffin commented 4 years ago

I will close this issue in favor of #812 as it has the fix in it.