LSTM many to many mapping problem

mina-nik commented 8 years ago

I want to implement the forth architecture from the left (first many to many):

I my case the length of input and outputs aren't equal (I mean that number of blue and red units aren't equal)! I have n samples to train the NN then the shape of input is nn_prev1 and the out shape is nn_nxt1. My model is as follows:

batch_size = 50 
n_prev = 100
n_nxt = 5
print("creating model...")
unit_number = n_nxt
model = Sequential()
model.add(LSTM(unit_number,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=True,
           stateful=True))
model.add(Dropout(0.2))
model.add(LSTM(unit_number,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=False,
           stateful=True))
model.add(Dropout(0.2))
model.add(Dense(n_nxt))
model.compile(loss='mse', optimizer='rmsprop')

print('Training')
numIteration = len(X_train)/batch_size
for i in range(epochs):
    print('Epoch', i, '/', epochs)
    for j in range(numIteration):
        print('Batch', j, '/',numIteration,'Epoch', i)
        model.train_on_batch(X_train[j*batch_size:j*batch_size+batch_size,], y_train[j*batch_size:j*batch_size+batch_size,])

    model.reset_states()

print('Predicting')
predicted_output = model.predict(X_test, batch_size=batch_size)

But I think I have implemented the third one (because return_sequences=False in the second LSTM layer, according to this article: https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent ). I think I have one output from the second LSTM layer like third architecture (from the left) and then it is extended to five outputs by Dense layer. When I set return_sequences=True in the second LSTM layer I get the following error:

Traceback (most recent call last):
 File "/home/mina/Documents/research/one cores (LSTM) seq2seq  one core vector2vector  train on batch statefull 15/KerasLSTM.py", line 259, in <module>
main()
 File "/home/mina/Documents/research/one cores (LSTM) seq2seq  one core vector2vector  train on batch statefull 15/KerasLSTM.py", line 200, in main
   model.add(Dense(n_nxt))
 File "/usr/local/lib/python2.7/dist-packages/keras/layers/containers.py", line 68, in add
self.layers[-1].set_previous(self.layers[-2])
 File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 85, in set_previous
str(layer.output_shape))
AssertionError: Incompatible shapes: layer expected input with ndim=2 but previous layer has output_shape (1, 100, 5)

Can I omit dense layer? As I understood If I don't set return_sequences=True I wil have the third architecture, is it right? Is there any problem that the length of the input is different from the output length?

carlthome commented 8 years ago

You're on the right track. Set return_sequences=True then make sure to wrap Dense with a TimeDistributed "wrapper" layer. Its Keras' way of applying a layer to several time steps.

Different lengths of input and output is not a problem as long as all your input samples are of the same length, and all of the output samples are of the same length. If not, you should pad sequences with a dummy value so they're all of equal length (you could still treat inputs and targets separately) and add Masking layers.

mina-nik commented 8 years ago

@carlthome : Thanks a lotttttt. My Keras doesn't have wrappers sub-module to import TimeDistributed! Is it possible? How about TimeDistributedDense? If I set return_sequences=True and use model TimeDistributedDense, does it work in the right way or not?

fluency03 commented 8 years ago

@mininaNik TimeDistributedDense is deprecated from Keras 0.3.3 version, see here. In Keras>=1.0, you will find the wrapper TimeDistributed(), see here, where you can use it by putting Dense() into this wrapper.

mina-nik commented 8 years ago

@fluency03 Thank you. The point is that my Keras version is 0.3.1 and It doesn't have wrapper and can't import this module. I might need to upgrade keras!

fluency03 commented 8 years ago

@mininaNik I think this will do: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

mina-nik commented 8 years ago

Thank you @fluency03. I upgraded it and it works now.

This is my final code. I want to predict "n_nxt" steps ahead by using "n_prev" previous values. I have n samples to train the NN then the shape of input is nn_prev1 and the out shape is nn_nxt1

 n_nxt = 5
 batch_size = 50
 epochs = 12
 n_prev = 100

  print("creating model...")
  unit_number = 12
  model = Sequential()
  model.add(LSTM(unit_number,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=True,
           stateful=True))
  model.add(Dropout(0.2))
  model.add(LSTM(unit_number,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=True,
           stateful=True))
 model.add(Dropout(0.2))
 model.add(TimeDistributed(Dense(n_nxt)))
 model.compile(loss='mse', optimizer='rmsprop')

print('Training')
numIteration = len(X_train)/batch_size
for i in range(epochs):
    print('Epoch', i, '/', epochs)
    for j in range(numIteration):
     print('Batch', j, '/',numIteration,'Epoch', i)
     model.train_on_batch(X_train[j*batch_size:j*batch_size+batch_size,], y_train[j*batch_size:j*batch_size+batch_size,])

 model.reset_states()

print('Predicting')
predicted_output = model.predict(X_test, batch_size=batch_size)

my model is as follows: 111

Since my batch_size is 50, the input shape for each training step is: (50, 100, 1) According to the model at the end the output shape is (50, 100,5 ), but I want this shape: (50, 5, 1). Could you please help me to solve this problem? (I want to predict 5 steps ahead, but the length is 100 at the end instead of 5). Here I have 12 input nodes and all of them are feeding with 100 inputs, I think the right way is having 100 input nodes feeding by 1 only! Is it right?

fluency03 commented 8 years ago

I guess what you want is:

input shape is (50, 100, 12), output shape is (50, 5, 12).

But I am not sure whether Keras support such many-to-many case with different input and output length. You can try it. I think it should be.

mina-nik commented 8 years ago

shape=(nb_sample, time_dimension, input_dim)

my input is a numpy matrix n1001 and output is n51 (I want to predict 5 steps ahead by using 100 previous values) and my input_dim and output_dim are 1. When I feed the NN, nb_sample is set to batch_size automatically (nb_sample =batch_size). Since I set number of units for middle layers to 12 (I chose it for now reason and it can be changed) the last entry of the output shape has changed to 12.

The shape that I used for training the network is as follows: (batch_size,100,1)=(50,100,1) and I expect to predict five steps ahead and have (50,5,1) as output shape. Indeed I want an NN that converts 100 inputs to 5 outputs.

fluency03 commented 8 years ago

Then I think this model.add(TimeDistributed(Dense(n_nxt))) should be actually model.add(TimeDistributed(Dense(1))) since your single input is with dimension (1, 1, 1) instead of (1, 1, input_length). If it is (1, 1, input_length), then it should be model.add(TimeDistributed(Dense(input_length))).

For example, if this is your data:

X_train | y_train ABCDEFGHIG | KLM ....

Each of the char can be represented by an one-hot-vector so each of them is a vector of input_length=26 for there are 26 letters. This input length depends on your word embedding method.

Then the input dimension here is (nb_sample, 10, 26), output dimension is (nb_sample, 3, 26). And the dense layer is TimeDistributed(Dense(26)).

mina-nik commented 8 years ago

shape=(nb_sample, time_dimension, input_dim)

time_dimension = 100 for input, time_dimension=5 for output,and nb_sample=batch_size=50. Since I have only one time series my input_dim is 1. If I use model.add(TimeDistributed(Dense(1))), then the output shape will be (50,100,1) which should be (50,5,1). For your second suggestion ,model.add(TimeDistributed(Dense(input_length))), the output shape will be (50,100,100). Again my output should be (50,5,1). The point here is that I can't change time_dimension from 100 to 5!

According to your sample I should use model.add(TimeDistributed(Dense(1))), but the problem is that my y_train shape is (nb_sample,5,1) but the output shape of this layer (model.add(TimeDistributed(Dense(1)))) is (nb_sample,100,5)!! The, I get error in optimization stage because of shape mismatching. My sequence is a series of real numbers.

fluency03 commented 8 years ago

No. You did not understand my point.

I said the parameter of TimeDistributed(Dense(parameter)) should be the number of classes (features). It has nothing to do with the timesteps n_nxt and n_prev here.

As to your exactly problem, I said I am not sure whether Keras supports input and output with different timesteps like (nb_sample, 100, 1) -> (nb_sample, 5, 1). @fchollet, please confirm this or give some suggestions.

However, if you train your model well as (nb_sample, 100, 1) -> (nb_sample, 1, 1), it will be capable of predicting the next several data.

Or, if you train your model as (nb_sample, 100, 1) -> (nb_sample, 100, 1), but for the data X and y, instead of letting them have one time difference, what about letting them have the time difference as n_nxt you want.

mina-nik commented 8 years ago

@fluency03 Thanks a lottt. Yes I misunderstood you :) I think if @fchollet confirms that keras doesn't support input and output with different timesteps, then the best way for me is testing your last suggestion.

fluency03 commented 8 years ago

Or, have you considered custom layer here for reshaping the data? Is that possible for solving this problem @fchollet ?

mina-nik commented 8 years ago

Thank you. No I haven't. I will check it.

carlthome commented 8 years ago

So your current model ends with model.add(TimeDistributed(Dense(n_nxt))) meaning that you apply a dense operation to your input data and return _nnxt number of outputs per input timestep. This is not what your target data is right? Read up on encoder-decoder networks. Here's an example stolen from @EderSantana from https://github.com/fchollet/keras/issues/562.

model = Sequential()
model.add(LSTM(inp_dim, out_dim, return_sequences=False))  # Encoder
model.add(RepeatVector(sequence_length))
model.add(LSTM(out_dim, inp_dim), return_sequences=True)  # Decoder

@fluency03 KTH represent! :smile:

fluency03 commented 8 years ago

@mininaNik this addition_rnn is indeed a good example and I think can solve your problem.

Thanks @carlthome ! Cheers, KTH ! :)

mina-nik commented 8 years ago

@fluency03 and @carlthome Thanks a lot, I will check :+1: Cheers :)

mina-nik commented 8 years ago

The output_dim specifies the length of the created vector by encoder. What is your idea about this value? Do you know that having large vector is better or it should be small?

carlthome commented 8 years ago

That's a question for Google Scholar. Typically smaller, in autoencoder settings (otherwise what would be the point, right?) but an argument can be made for having a larger hidden layer size as a type of regularization apparently. Search! Also, since this issue has been resolved please close it.

mina-nik commented 8 years ago

@carlthome You are right. Thank you.

vinayakumarr commented 8 years ago

Any body having character based text classification code in keras like in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/text_classification_character_rnn.py

mina-nik commented 8 years ago

@vinayakumarr Sorry I don't have.

CuriousCat-7 commented 7 years ago

@carlthome Actually I'm not happy to get this result, for actually, the encoder-decoder model just use join two LSTMs to fix this problem. That means you use first LSTM to do a summary and the second LSTM decode the summary. But in practice, we sometimes need to deal with a quite large sequence, and we will lose most of the information if we use a small LSTM doing the summay. And in paper Neural Turing Machine (you can google it), it has been proved that the performance will drop largely if the sequence number up to 20 in some cases.

So we still need a way to implement the first many-to-many case.

carlthome commented 7 years ago

Note that the NTM has been superseded by the DNC but I've heard that both are finicky during training and that a two-layer GRU is a better choice for most applications. Regardless, many-to-many is only straightforward if each training example has a constant ratio of input sequence length to output sequence length, otherwise a seq2seq model (the encoder-decoder variant you describe) where any number of input timesteps can be mapped to any number of output timesteps (like in automatic speech recognition) is required.

To avoid compressing the information in a seq2seq it seems beneficial to add some attention mechanism where you design additional inputs to the decoder about what input timesteps were important. For example:

keras-team / keras

LSTM many to many mapping problem #2403