keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.47k forks source link

LSTM sequence to sequence architectures (Encoding question) #2496

Closed mina-nik closed 8 years ago

mina-nik commented 8 years ago

I wanted to implement the forth architecture from the left (first many to many): 1

I my case the length of input and outputs aren't equal (I mean that number of blue and red units aren't equal)! I have n samples to train the NN then the shape of input is nn_prev1 and the out shape is nn_nxt1. I want to have a stateful NN. I understood how to do it by help of @carlthome and other guys in issue #2403. It is my final code:

 hidden_neurons = 50
 model = Sequential()
 ##Encoder
 model.add(LSTM(hidden_neurons,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=False,
           stateful=True))
model.add(Dense(hidden_neurons))
model.add(RepeatVector(n_nxt))
##Decoder
model.add(LSTM(hidden_neurons,
           batch_input_shape=(batch_size, n_prev, 1),
           forget_bias_init='one',
           return_sequences=True,
           stateful=True))
 model.add(TimeDistributed(Dense(1)))
 model.compile(loss='mse', optimizer='rmsprop')

Now I am skeptical whether my encoder is working like fig(1) or fig(2)? image fig(1)

image fig(2)

fchollet commented 8 years ago

Your model is basically fig. 2 stacked on top of fig 1.

braingineer commented 8 years ago

agreed, and also @mininaNik 's model has a Y on all timesteps model.add(TimeDistributed(Dense(1)))

mina-nik commented 8 years ago

Thank you.

mina-nik commented 8 years ago

@fchollet I am skeptical about sth yet. I will be so grateful if you help me. Are hidden neurons connected to each other within each LSTM layer? I don't know how feedback connections are managed here. since the hole sequence (with length of nb_samples) is fed to all hidden neurons in the first layer, I think that there shouldn't be connections between neurons within a layer. When I checked the shape of the input and output of layers I concluded as follows (maybe I am wrong!): When my input shape is (nb_samples, timesteps, input_dim), the hole of input 3D tensor is fed to all hidden_neurons in my first LSTM layer and then since return_sequences=False is false we will have only one output from the last neuron in this layer (the output shape will be a 2D tensor with shape `(nb_samples, output_dim), where output_dim = hidden_neurons in my case).

braingineer commented 8 years ago

not fchollet, but going to jump on it:

Are hidden neurons connected to each other within each LSTM layer?

I'm not sure what you mean here. So I'm going to just explain how the RNN code works in a general way.

With an RNN, there is an internal state. That state is updated at every time step with the value of its last hidden state and the new input.

If you follow the RNN code to the theano backend, (keras.backend.theano_backend) you'll find find a def rnn function. Inside this, there are 4 cases: mask/no mask and unroll/don't unroll.

The simplest to follow and understand is the no mask and don't unroll case. https://github.com/fchollet/keras/blob/master/keras/backend/theano_backend.py#L671

In this implementation, the inputs have been permuted so they went from shape (nb_samples, timesteps, input_dim) to (timesteps, nb_samples, input_dim). These are passed into the theano.scan which performs what is akin to a for loop over the 0th dimension (this is why timesteps was put on the 0th dimension).

the _step function here describes what happens inside the loop. This is where the hidden state is updated.

theano.scan outputs the outputs from each timestep concatenated. You can see that last output is the last item in this concatenation. This is what is sent out when return_sequences=False.

carlthome commented 8 years ago

@mininaNik, are you asking whether Keras' LSTM looks like this image

or more like this image ? I think the first one.

I wonder if without TimeDistributed(Dense(...)) this isn't more what you'd get, by the way: image

mina-nik commented 8 years ago

@carlthome :+1: You are right. You understood my question precisely. If we consider input shape for lstm layer like this: (nb_samples, timesteps, input_dim), then the input_dim is number of Xs (number of input nodes) in the input layer and each input (each X in input layer) is a vector (sequence) with the length of timesteps. Am I right?

A. In Keras when return_sequences=False: If we consider input shape for lstm layer (nb_samples, timesteps, input_dim) and number of neurons in this layer equals to _hiddenneurons, then the output shape will be (nb_samples,hidden_neurons) for this layer, which means we have only the last output for the whole sequence in each lstm neuron.

B. In Keras when return_sequences=True: If we consider input shape for lstm layer (nb_samples, timesteps, input_dim) and number of neurons in this layer equals to _hiddenneurons, then the output shape will be (nb_samples, timesteps ,hidden_neurons) for this layer, which means we have the full sequence as output in each lstm neuron.

By considering A and B, I can understand completely why when we want to predict _nnxt next steps by using _nprev steps (where n_prev!=n_nxt) we have to use encoder to decoder architecture. Here my question is that: Is it a fundamental problem for RNN neural networks, or is it the problem some libraries that we can't have different time dimension for input and output?

Thank you @braingineer for your good reference.

braingineer commented 8 years ago

which means we have only for the hole sequence the last output in each lstm neuron

Almost. You have the last output from the last LSTM cell. You can think of it as slicing off the end of a list some_list[-1]

we have the full sequence as output in each lstm neuron

I think using the word 'neuron' is really throwing you off here (or, it's just throwing me off). The layer outputs the full sequence. That just means it doesn't return the last slice of the time dimension. Each time step is responsible for its own calculations.

Using your example, (nb_samples, timesteps, input_dim) is processed by the RNN at every time step by turning a matrix of size (nb_samples, input_dim) into a matrix of size (nb_samples, output_dim). Each time step is done one at a time. The recurrence relationship just says that the matrix multiplication that results in (nb_samples, output_dim) at time t is not only dependent on what the values of the matrix (nb_samples, input_dim) are at time t, but also what the results of the matrix calculation at t-1.

So, it might be better to think about it as just a series of input matrices, each performing a single matrix calculation. As they do it in succession, the previous matrix calculation passes its result onward so that the next matrix calculation can include it in its result.

Is it a fundamental problem for RNN neural networks, Or is it the problem some libraries that we can't have different time dimension for input and output?

I don't think I understand as well as @carlthome does, so I'm confused by what you mean here again.

I think you need to read up more on the nature of sequence modeling in general. Markov Chains were the first statistical sequence models. They were used to try and describe a series of discrete behaviors. There is a good visualization here (http://setosa.io/ev/markov-chains/). Now, in these, they have some self-connections. They make for drawing the diagram easier, but a beginner thought exercise is:

You are a doorman at a building. You can never go outside, but you can see people walking into the building every day. Each day you wonder if it is raining or if it is sunny. Every day you get to see people either carrying umbrellas or not carrying umbrellas.

Now, as the doorman, you want to calculate the state of the weather. This is known as a latent state or hidden node because it is never observed. The observations of umbrellas are your inputs every day. These are what get fed into the sequence model. The transitions are what occur in the middle part (the middle part of the RNN). In Markov chains, this was the most important part. The goal was to model how things changed and use evidence to better model that.

So, at time t, it is Monday and you discover it is sunny. This is one timestep. At time t+1, you do not know what the weather is, but you know that the P(sunny today | sunny yesterday) = 0.8, so you can give a pretty good guess. the weather of yesterday influencing the weather of today is the self-connection in the RNN.

the key observation that neural network researchers had was that you lose information when you exit the continuous energy space and make discrete judgements. There's a lot more technical intuitions here, but that's the basics of it. It's why markov models are not as good at modeling as conditional random fields which aren't at good at modeling sequences as RNNs.

So, basically, what you are asking, is about whether you, as the doorman, should return all of your guesses at every timestep, or just your last guess. That's all return sequences does. Each day, you are the same door man and each day you get new evidence and you know how the hidden state changes over time. That's the bare essence of sequence modeling.

Edit: sorry for rambling or if this doesn't make sense. Too much time without sleep.

edit 2: woops forgot about your last question. now, given the context of everything I've said above:

Here my question is that: Is it a fundamental problem for RNN neural networks, Or is it the problem some libraries that we can't have different time dimension for input and output?

This is known in the classic literature as filtering. It's not a problem for any sequence model to make forward predictions. The issue is to make good forward predictions. That is the topic of research (or engineering.. depending on who you ask)

mina-nik commented 8 years ago

Thanks a lot @braingineer :+1: I should read your references charily.

I don't think I understand as well as @carlthome does, so I'm confused by what you mean here again.

If you read my issue here #2403 you can find what I mean. I wanted to have sequence to sequence mapping by using RNN, where the length of the input and output sequences were fixed but not equal. I found that the only way for changing the time dimension (since the length of the input sequence and output sequence are different) is using encoder-decoder network. I wanted to know whether it is the same for all RNNs in all libraries (I mean I have to use encoder-decoder for all libraries for my case) and it has a theoretical reason, or it is only Keras problem. By your today's comments I think it has a theoretical reason. I should read more :)

carlthome commented 8 years ago

If we consider input shape for lstm layer like this: (nb_samples, timesteps, input_dim), then the input_dim is number of Xs (number of input nodes) in the input layer and each input (X) is a vector (sequence) with the length of timesteps. Am I right?

input_dim corresponds to the number of neurons in the input, yes. A sequence consists of timesteps many such vectors, forming a matrix. I'm unsure precisely how Keras' manages to take the output neurons of the previous timestep and feed it back with the next input vector though, but I'd imagine the Theano.scan or equivalent TensorFlow function in the backend takes care of this, like how @braingineer explained.

Is it a fundamental problem for RNN neural networks, Or is it the problem some libraries that we can't have different time dimension for input and output?

Each input vector must result in an output vector. In theory you could certainly do something more exotic. It's just multiplication of scalars, after all. However, it would probably not make much sense. ANNs fit a function f(x)=g(y) where g is some hidden function that we don't know but want to approximate by providing examples of input and output values. It wouldn't make sense to have functions that sometimes don't produce any output despite receiving valid input, in fact it's undefined behavior.

For just predicting several timesteps ahead though an encoder-decoder is not needed, you could either: a) append each predicted output vector to its input sequence, and then feed the resulting, new input sequence back into the network, rinse and repeat; or b) set stateful=True and do precisely the same thing but instead of feeding the entire concatenated sequence back into the model each time, you'd just need to feed the output vector back into the model. Preferably you'd set timesteps=1 to get the performance benefits (otherwise you'd need to pad and mask dummy input vectors for implementation reasons) but then there's a cost of not being able to perform BPTT properly (which doesn't seem that problematic in practice).

mina-nik commented 8 years ago

@carlthome Thanks a lot :+1:

iwinterlovermy commented 8 years ago

Hi @carlthome ,

I tried to do multiple step prediction following your advice. I tried modify the code stateful_lstm.py to do so, but I got weird result. I don't understand completely how to do implement your advice for multiple steps prediction into codes. I wrote it as below. Can you advice me on the write way to do prediction?

The prediction codes I wrote are as below:

for i in range (10):
      prediction_output = model.predict(predict_test_output_t, batch_size=batch_size)
      predict_test_output_t = prediction_output 

Thanks in advance

carlthome commented 8 years ago

@iwinterlovermy,

Sure, but first:

  1. Make a separate issue (or better yet, go here: https://groups.google.com/forum/#!forum/keras-users).
  2. Also, please fix the markdown code formatting.
iwinterlovermy commented 8 years ago

Hi @carlthome,

Noted. Sorry for the confusion. I don't have experience posting issues in github.

Thanks!

iwinterlovermy commented 8 years ago

Hi @carlthome,

I've created the topic at https://groups.google.com/forum/#!topic/keras-users/OLL98nhaJkU

Thanks in advance for your help

iwinterlovermy commented 8 years ago

Hi @carlthome ,

Referring to your suggested method for multiple steps prediction using stateful=True, may I know what do you mean by feeding only the output vector back into the model? Referring to the stateful_lstm.py where the model.predict requires a batch size of 25, does it mean that the output vector is predicted[-25:,], the last 25 (batch size) records to be fed into the next prediction?

Thanks in advance!