keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.9k stars 19.45k forks source link

LSTM fully connected architecture #4149

Closed carrasRuf closed 7 years ago

carrasRuf commented 8 years ago

Hi everyone, First, I would like to express my gratitude to all people who work daily trying to improve Keras software and its documentation. Thank you guys ;)

After reading many posts trying to sort out all my questions...I still have some doubts related to LSTM recurrent networks, so, I hope you can help me.

Input shape: (nb_samples, timesteps, input_dim). So, I have 11200 nb_samples. Each nb_samples contains 3000 timesteps and finally, each timesteps contains 22 values. Therefore, this is my input shape (11200, 3000, 22).

Output: every nb_samples must be classified in one class ('0' or '1').

Goal: classify every nb_samples in one of the two possibles classes ('0' or '1') using a LSTM fully-connected network.

Architecture to follow:

image

In the following posts I found very useful information related to my problem: #2673 and #2496. However, I still have many doubts:

  1. As far as I know, a LSTM layer at the beginning of the model is not fully connected as @carlthome and @fchollet explained in #2673.
  2. As the goal is to classify each nb_samples to one class ('0' or '1'), TimeDistributed(Dense(...)) shouldn't be used due to, as far as I know, this layer provides an output per timestep and what I want is to classify one nb_samples on the whole to '0' or '1' classes.
  3. In this easy architecture there is only one LSTM layer and therefore, the use of return_sequence doesn't matter. However, in case of having two LSTM layers, should return_secuence = true or false? I think in my model it should be return_sequence=true as it was explained in #2496 but 'm not pretty sure about it.

Let's start with the first approach of the model (although I know it is wrong).

timesteps = 3000
input_dim = 22 

model = Sequential()
model.add(LSTM(22, input_shape=(timesteps, data_dim)))
model.add(Dense(1, activation='sigmoid'))

Does anyone can help me to build my model and solve all my questions? Thank you very much in advanced!

carlthome commented 7 years ago

Actually, looking at https://github.com/fchollet/keras/issues/2673 again and looking at the LSTM equations in Keras I think there might have been some more confusion because it sure looks like the inputs are fully-connected at each time step due to the matrix multiplication of the input vector x and the weights matrix W.

In other words, the model you posted should be correct (though I don't get why in the diagram the output layer has arrows going back into the hidden layer? Is it just backprop?)

carrasRuf commented 7 years ago

Thanks for your fast answer @carlthome ! you are right. Looking LSTM equations seem to be fully-connected at each time step, great! :)

Regarding your question about the arrows going back into the hidden layer, you are right, it is just backprop. I've seen in many articles it performs very well so I was thinking about using it. Taking a look of the code, I've seen that Recurrent class contains a flag (go_backward) which allows backwards (or at least I think so) but not LSTM class. Any idea? I might use BLSTM #1629 in order to achieve that...

Another question @carlthome , I think the output of each neuron inside of the LSTM layer (in my case 22 neurons according to the code posted) is just connected to itself and the output layer not with the 21 remaining neurons of the same LSTM layer. Am i right? So, my model looks like:

image

And NOT like:

image

Therefore, and correct me if I'm wrong, the following code look like this:

model = Sequential()
model.add(LSTM(22, return_sequences=True, input_shape=(timesteps, data_dim)))
model.add(LSTM(3))
model.add(Dense(1, activation='sigmoid'))

image

Rigth? Thank you @carlthome and rest of the people in advanced for all your help.

JuniorIng commented 7 years ago

Hi, I think you can find the answer in this paper: "A Critical Review of Recurrent Neural Networks for Sequence Learning", Lipton, Berkowitz

"The output from each memory cell flows in the subsequent time step to the input node and all gates of each memory cell."

So, each memory cell is connected to each memory cell in one layer. So the implementation in keras should be as your demands. Anyway, I am starting with LSTMs as well and that paper was a great introduction to the topic ....

carlthome commented 7 years ago

@carrasRuf, right. LSTM cells in the same layer are not connected to each other AFAIK.

JuniorIng commented 7 years ago

@carlthome, are you sure about that? As the paper stated above describes the lateral connections between LSTM cells in one layer between the timesteps. May be some developer knows about the implementation in keras?

carlthome commented 7 years ago

Linear algebra is not my favorite, but it sure looks to me like each sample is multiplied with each weight, and that's it. The input gate for example: x_i = K.dot(x * B_W[0], self.W_i) + self.b_i. @JuniorIng, what do you think? FYI B_W is for RNN dropout as an elementwise binary mask.

JuniorIng commented 7 years ago

hmm, good point ... but have a look at the SimpleRNN ... It is stated as fully connected and its state is calculated very similar to the input gate of the LSTM:

h = K.dot(x * B_W, self.W) + self.b ... output = self.activation(h + K.dot(prev_output * B_U, self.U))

So may be x includes the outputs of the other nodes of the layer?! Well, I am not sure about this, but I can see no reason why the implementation in keras should be different from the standard model

zackchase commented 7 years ago

Hi. Hope I can help. Let's consider a single layer h1 at time step 1. The output from h1 at time step 1 feeds into h1 at time step 2. It is precisely this connection, from a layer at one time step to the same layer at the subsequent time step that makes it a recurrent neural network. This is true both for LSTMs and for simple RNNs. The difference with the LSTM is that instead of having simple sigmoid units you have a more complicated structure (a memory cell).

Note, in modern implementations, the LSTM layers are usually "stacked". In this case, h1 (at time 1) feeds into h1 at time 2 (forward in time) AND also into h2 at time 1 (up the stack).

Hope this helps.

carlthome commented 7 years ago

No, that doesn't help.

aisopos commented 7 years ago

hi folks, was there a consensus regarding a layer being fully connected or not? In a single layer, is the output of each cell an input to all other cells (of the same layer) or not?

carlthome commented 7 years ago

They are fully-connected both input-to-hidden and hidden-to-hidden. You have batch_size many cells. They are never mixed but the weights are shared.

aisopos commented 7 years ago

Great, thanks a bunch carlthome for clarifying this :)

But, why do I have 'batch_size ' number of cells? I thought that, if it's an input layer, for a shape of (batch_size, timesteps, input_dim), number of cells is 'input_dim'. While, if it's an intermediate layer, eg defined by model.add(LSTM(x)), number of cells is 'x'. Is this correct?

Thanks again :)

carlthome commented 7 years ago

For each input vector (at any given time step) you get one output vector and one cell state vector. Both the cell state vector and the output vector are used in the next time step together with the next input (for the first time step they are all zeroes). Thus you'll get as many cell states as input vectors in a batch.

Each cell state vector consist of output_dim many values, as does the output vector. The input vector could have fewer (or more) weights because an input-to-hidden matrix multiplication is always performed first, at each time step, before calculating the cell state and output.

aisopos commented 7 years ago

Thanks again carlthome. A follow-up, potentially newbie question, to make sure I understood it right:

Assume a single-layer LSTM as follows: batch size = 100 time steps = 10 inputs = 3 outputs = 2 1 hidden LSTM layer = 6 nodes 1 Dense layer = 2 nodes (= outputs)

...for each time step (1 to 10) we provide an... input_vector[inputs][batch size] = [3][100] -> size is 300

...and we get... output_vector[outputs][batch size] = [2][100] -> size is 200 cell_state_vector[ num of LSTM nodes??? num of outputs nodes??? ][100]

...which 2 we feed to the next time step, together with a new input_vector correct?

carlthome commented 7 years ago

No. I think you should review the LSTM equations, and watch this before you proceed. Have you studied linear algebra before? If not, I propose you take a course on the subject.

carlthome commented 7 years ago

I'd also like to add, looking back at my own confusion last year, that most of these graphs are not helpful at all in understanding what's actually happening. It's a lot easier to go by the equations instead.

aisopos commented 7 years ago

OK, will go through these, thanks again :)

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

nxthuan512 commented 6 years ago

Hi, I'd like to confirm the draw from @carrasRuf .

I'm also doing with 2-layer LSTM with the Keras code below. L1 and L2 are the number of LSTM units in layer 1 and layer 2, respectively. The number of timesteps and the number of features in each timestep are T and 16, respectively. model = Sequential() model.add(LSTM(L1, return_sequences=True, input_shape=(T, 16))) model.add(LSTM(L2, return_sequences=True)) model.add(TimeDistributed(Dense(1, activation='sigmoid')))

Is the network below equivalent to the code since I am using TimeDistributed model_01

If we define the dimension of input/output is (n_sample, T, 16), the input of each LSTM of layer 1 is (, T, 16) and the input of each LSTM of layer 2 is (, T, L1). Finally, the input of output layer is (, T, L2).

Please let me know if the draw is correct or not. Thanks.

kasuteru commented 5 years ago

Hi, I'd like to confirm the draw from @carrasRuf .

I'm also doing with 2-layer LSTM with the Keras code below. L1 and L2 are the number of LSTM units in layer 1 and layer 2, respectively. The number of timesteps and the number of features in each timestep are T and 16, respectively. model = Sequential() model.add(LSTM(L1, return_sequences=True, input_shape=(T, 16))) model.add(LSTM(L2, return_sequences=True)) model.add(TimeDistributed(Dense(1, activation='sigmoid')))

Is the network below equivalent to the code since I am using TimeDistributed model_01

If we define the dimension of input/output is (n_sample, T, 16), the input of each LSTM of layer 1 is (, T, 16) and the input of each LSTM of layer 2 is (, T, L1). Finally, the input of output layer is (, T, L2).

Please let me know if the draw is correct or not. Thanks.

Were you able to evaluate whether your schema is correct?