hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

Rank mismatch error in Custom policy for multi-discrete action space #395

Closed sheetalsh456 closed 5 years ago

sheetalsh456 commented 5 years ago

Hi,

I was trying to run ACKTR with a custom policy and an environment which uses a Box observation space and a Multi-discrete action space.

Here is my environment init():

def __init__(self):
    self.reward_range = [0,100]
    self.action_space = MultiDiscrete([100]*30)
    self.observation_space = Box(low=-100000, high=100000, shape=(2880, 14), dtype=np.float16)

And my custom policy is:

class CustomPolicy(ActorCriticPolicy):
   def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
    super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)

    with tf.variable_scope("model", reuse=reuse):

        print("self.processed_obs", self.processed_obs.shape) # (bs, 2880, 14)

        bs_number, seq_len, num_features = self.processed_obs.shape # (bs, 2880, 14)
        bs = tf.shape(self.processed_obs)[0]

        # CNN acts as a feature extractor
        conv1 = multi_gpu_model(Sequential([
            Conv2D(input_shape=(1, num_features, seq_len), 
                   filters=8, 
                   kernel_size=(3,3), 
                   strides=(2, 1), 
                   padding='same', 
                   data_format="channels_first", 
                   use_bias=False),
        ]), gpus=6)

        # (bs, 2880, 14) -> (bs, 1, 2880, 14) -> (bs, 1, 14, 2880) -> (bs, 8, 7, 2880)
        conv_embedding1 = conv1(tf.transpose(tf.expand_dims(self.processed_obs, axis=1), perm=[0,1,3,2])) # (bs, 8, 7, 2850)

        batch_mean, batch_var = tf.nn.moments(conv_embedding1,[0])
        scale = tf.Variable(tf.ones([seq_len]))
        beta = tf.Variable(tf.zeros([seq_len]))
        epsilon = 1e-3

        # shape remains the same - (bs, 8, 7, 2880)
        cnn_embedding1 = tf.nn.dropout(
            tf.nn.relu(tf.nn.batch_normalization(conv_embedding1, batch_mean, batch_var, beta, scale, epsilon))
            , 0.2)

        conv2 = multi_gpu_model(Sequential([
            Conv2D(input_shape=(8, num_features, seq_len), 
                   filters=16, 
                   kernel_size=(3,3), 
                   strides=(1, 1), 
                   padding='same', 
                   data_format="channels_first", 
                   use_bias=False),
        ]), gpus=6)

        # (bs, 8, 7, 2880) -> (bs, 16, 7, 2880)
        conv_embedding2 = conv2(cnn_embedding1) # (bs, 16, 7, 2880)

        batch_mean, batch_var = tf.nn.moments(conv_embedding2,[0])
        scale = tf.Variable(tf.ones([seq_len]))
        beta = tf.Variable(tf.zeros([seq_len]))
        epsilon = 1e-3

        # shape remains the same - (bs, 16, 7, 2880)
        cnn_embedding2 = tf.nn.dropout(
            tf.nn.relu(tf.nn.batch_normalization(conv_embedding2, batch_mean, batch_var, beta, scale, epsilon))
            , 0.2)

        new_num_features = cnn_embedding2.shape[1] * cnn_embedding2.shape[2] # 112

        # LSTM is used for time-series prediction (3-layered)
        bilstm = multi_gpu_model(Sequential([
            Bidirectional(LSTM(units=32, return_sequences=True), 
                          input_shape=(int(seq_len), int(new_num_features))),
            Bidirectional(LSTM(units=64, return_sequences=True)),
            Bidirectional(LSTM(units=64, return_sequences=True))
        ]), gpus=6)

        # Converts (bs, 16, 7, 2880) -> (bs, 16*7, 2880) -> (bs, 2880, 112) -> (bs, 2880, 128)
        feature_layer = bilstm(
            tf.transpose(
                tf.reshape(cnn_embedding2, [bs, new_num_features, seq_len]), perm=[0,2,1]))

        print("feature_layer", feature_layer.shape) # (bs, 2880, 128)

        # The non-shared components (MLP)

        pi_layers = multi_gpu_model(Sequential([
            Dense(128, input_shape = (2880, 128), 
                  kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l1(0.01)),
            Activation('relu'),
            Dense(128, kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l1(0.01))
        ]), gpus=6)

        pi_latent = pi_layers(feature_layer)[:, -1, :]

        print("pi_latent", pi_latent.shape) # (bs, 128) - 2d

        vf_layers = multi_gpu_model(Sequential([
            Dense(32, input_shape = (2880, 128),
                  kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l1(0.01)),
            Activation('relu'),
            Dense(32, kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l1(0.01))
        ]), gpus=6)

        vf_latent = vf_layers(feature_layer)[:, -1, :]

        print("vf_latent", vf_latent.shape) # (bs, 32) - 2d

        value_fn = tf.layers.dense(vf_latent, 1, name='vf')

        print("value_fn", value_fn.shape) # (bs, 1) - 2d

        self._proba_distribution, self._policy, self.q_value = \
            self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)

    self._value_fn = value_fn
    self._setup_init()

def step(self, obs, state=None, mask=None, deterministic=False):
    if deterministic:
        action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
                                               {self.obs_ph: obs})
    else:
        action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
                                               {self.obs_ph: obs})
    return action, value, self.initial_state, neglogp

def proba_step(self, obs, state=None, mask=None):
    return self.sess.run(self.policy_proba, {self.obs_ph: obs})

def value(self, obs, state=None, mask=None):
    return self.sess.run(self.value_flat, {self.obs_ph: obs})

The shape of pi_latent is (batch_size, 128), the shape of vf_latent is (batch_size, 32) and value_fn is (batch_size, 1), and I get the following error:

ValueError: Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2).

I've tried making vf_latent & pi_latent as 3d tensors, but then I get the error:

ValueError: Shape must be rank 2 but is rank 3 for 'model/pi/MatMul' (op: 'MatMul') with input shapes: [?,30,128], [30,3000].

I also tried making value_fn a (batch_size, 30) tensor, but that gives me the rank mismatch error again.

The same code above works for a discrete action space, but I'm not sure what changes to make in my custom policy for a multi-discrete action space. Can someone please help me out?

Any help/suggestions will be greatly appreciated!

Thanks!

Miffyli commented 5 years ago

Docs say ACKTR does not support multi-discrete action spaces, so this could be a result of something inside ACKTR not playing nice with such action spaces.

I assume the Rank mismatch error happened in distributions.py around line 299 in the softmax_cross_entropy_with_logits_v2. Perhaps your actions were something the code did not expect? I would try caveman-debugging that part by printing whatever values are going inside tensorflow around those parks.

araffin commented 5 years ago

Quick question: is the bug only related to ACKTR? did you try with PPO2 for instance? (I would expect so looking at the title of the issue)

sheetalsh456 commented 5 years ago

Hi,

So yes, I just checked the docs, and seems like ACKTR doesn't support it. But A2C, PPO1 and PPO2 support multi-discrete action spaces. So I tried the same code above with these 3 algorithms (A2C, PPO1 and PPO2) and they gave the same error.

Also, I'm not conceptually sure of what the ideal shapes of pi_latent, vf_latent and value_fn should be when I use a multi-discrete action space of MultiDiscrete([100]*30)? I'd really like to understand their ideal shapes in this case.

araffin commented 5 years ago

Could you give us the complete traceback?

sheetalsh456 commented 5 years ago

Okay turns out the same code works for A2C and PPO. Probably it's just not supported by ACKTR at the moment. Thanks a lot! :)