Simple stateful LSTM example

volvador commented 7 years ago

Please consider this simple example

nb_samples = 100000
X = np.random.randn(nb_samples)
Y = X[1:]
X = X[:-1]
X = X.reshape((len(Y), 1, 1))
Y = Y.reshape((len(Y), 1))

So we have basically

Y[i] = X[i-1]

and the model is simply a lag operator.

Now I try to learn this model with a stateful LSTM, by giving the pairs of values (x, y) one by one (batch_size = 1)

model = Sequential()
model.add(LSTM(batch_input_shape=(1, 1, 1),
               output_dim =10,
               activation='tanh', stateful=True
              )
        )
model.add(Dense(output_dim=1, activation='linear'))
model.compile(loss='mse', optimizer='adam')

for epoch in range(10000):
    model.reset_states()
    train_loss = 0
    for i in range(Y_train.shape[0]):
        train_loss += model.train_on_batch(X_train[i:i+1],
                         Y_train[i:i+1],
                         )
    print '# epoch', epoch, '  loss ', train_loss/float(Y_train.shape[0])

but I am seeing a mean loss around 1, which is the standard deviation of my randomly generated data, so the model does not seem to learn.

Am I having something wrong?

bstriner commented 7 years ago

All LSTMs are stateful. Keras stateful only means stateful between batches.

The problem is that gradient can't backprop between batches. So you are using state t at time t+1, but those are two separate batches and gradient can't flow back to the hidden representation at t.

An LSTM can only learn dependencies effectively within a batch. Stateful can hypothetically learn something between batches but don't depend on it. Batch t+1 will do the best it can do based on hidden t, but hidden t will not try to make a good representation.

If you modify your code to make batches of 2 instead of 1, it should start working.

Cheers

volvador commented 7 years ago

Thanks Ben for this answer. Please bear with me on this example and correct my misunderstandings.

1- I tried with a batch_size = 2, and even with a batch_size = 20 and got the same result. Basically the model is not learning anything.

2- My understanding is that whatever batch_size I put, I will not see that y(t) depends on x(t-1). Indeed, as in every batch, I am giving the pairs (x(t), y(t)), back propagation through time will only compute the derivative of the loss with respect to x(t). The batch size is there only to estimate that derivative on a large number of samples, then take the mean. So whatever batch_size I put, I will not see the dependence on x(t-1). The only way I can see it is to put nb_steps (the length of the window of x sequence) to a number greater than 1).

So the LSTM does not learn dependcies within a batch : every row of the batch is treated independently, and the longer the batch, the better is the derivative estimates. Dependencies seem learned within the window of xs given to the model (nb_steps in keras nomenclature, i.e. the second parameter in batch_input_shape = (..., ..., ...).

What I hoped with my experiment was that the network will learn to simply put the input x[t] in its hidden state, then at t+1, return its hidden state as output y[t+1], and replaces the hidden state by x[t+1], and do like that recursively. Obviously, I can achieve this by using a stateless LSTM with nb_steps = 2 or larger, but wanted to have the result with a stateful one.

If I am right, it seems the LSTM or other recurrent neural network is able to learn a history as long as the input sequence (nb_steps in keras nomenclature). So if I am dealing with time series where y(t) depends on x(t-101) and it happens that the moving window of xs that I use as input to the model is of length 100 only, then the model will not learn anything.

Please correct me

bstriner commented 7 years ago

Yup. I meant batches with depth 2 instead of 1. Just realized how vague I was.

Backprop of an lstm is only within the batch and goes back the depth of the batch. If your depth is more than 1 it should work.

With depth one, the top level can be trained, but it won't backprop to the previous layers.

skjerns commented 7 years ago

Coming back to the example of volvador:

So if I am dealing with time series where y(t) depends on x(t-101) and the batch-depth (seq. length) is 100

In this case the function will not be able to learn our example, right?

So stateful=True will just give me a light version of recurrence?

bstriner commented 7 years ago

If your sequence length is 100 and the only dependency is 101 away, then then function will not learn that dependency.

If there is also another dependency 100 away, the model will learn to encode that dependency into the hidden state. If you run a stateful LSTM, the model will try to predict 101 from the hidden state at 100, which might still have some useful information.

The thing to keep in mind is that information travels through backprop, which is only working within a single batch.

Also, keep in mind these are largely hypothetical and an LSTM is probably not going to learn much with a depth of 100. It gets harder and harder to learn longer dependencies. LSTM will learn longer dependencies than simple RNNs but not infinite. You could read into highway networks or multi-timescale learning if 100 was not just a figure of speech.

Cheers

skjerns commented 7 years ago

Thanks a lot, that makes it more understandable.

My dependencies are actually not that long (though not actually known), I just have the problem that my dimensionality is large and using a sliding window approach would significantly increase training time. I thought with a stateful RNN I could circumvent giving overlapping slices after slices to the network. Turns out I can't.

bstriner commented 7 years ago

No free lunch. You should have sequences as long as whatever dependency you're trying to learn and at least a handful of sequences for batch. Good luck!

Cheers

akash13singh commented 7 years ago

@bstriner : I am confused by the use of term batch. Do you mean "one sample in a batch" when you say "batch" here. "Backprop of an lstm is only within the batch and goes back the depth of the batch. If your depth is more than 1 it should work".

As per my understanding BPTT is only done for an individual sample within a batch, rather than over the entire batch. Each sample is formed by concatenating l consecutive time steps together. Thus l decides the length of BPTT. And there is no way to backprop over different samples or preserve state from one sample to another in a batch. Am I correct?

Thanks

bstriner commented 7 years ago

Don't get too hung up on the language. In audio processing there are many samples per sequence. In text processing a sample is a sequence.

I like to think of a batch as being made of samples. Each sample is independent and the independent gradients for samples are combined. If you backprop between samples, you kind of only have one sample.

Then again, if you use something like batchnorm, there is a gradient between sample 0 at time 0 and sample n at time t. So maybe we should think of those samples as part of a meta-sample, because the gradients aren't independent, as they are in traditional batches.

And there is no way to backprop over different samples or preserve state from one sample to another in a batch. Am I correct?

If you could preserve state between samples in a batch then wouldn't you just have one sample?

Just remember that a standard LSTM input should have 3 dimensions (batch, sequence, feature) and will iterate over the middle dimension in each batch.

Good luck!

akash13singh commented 7 years ago

Yes, I read too much and got all tied up. But this clears it up. Thanks!

boubiou commented 7 years ago

Hi all, just to be completely clear, @bstriner when you say: "Just remember that a standard LSTM input should have 3 dimensions (batch, sequence, feature) and will iterate over the middle dimension in each batch."

is it also true when using the option stateful = True ?

if it is the case I am not sure to understand, we propagate the state across batches AND doing backprop on elements within a batch that are supposed to be independent ?

to quote fchollet in https://github.com/fchollet/keras/issues/98 " ' or should it consider that the samples in a batch are independent, but that the next batch will provide the samples that come chronologically next (i.e. batch_2[i] is the successor to batch_1[i] for all i)?'

Let's go with this behavior, and let's implement it as an option in existing recurrent layers (statefulkeyword argument in constructor, False by default).I believe this can be easily achieved by simply storing the last output and last memory at the end of a batch (e.g. as class attributes of the layer?), then passing these as outputs_info in the scan loop of the next batch. Remarks, concerns? "

Thank you for your time, I am a bit lost.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

mik1904 commented 7 years ago

I want to use LSTM for building a one step ahead predictor; there are lot of examples online that use a predefined "time window" to train the stateless LSTM neural network, I will make an example to make it more clear:

time window size = 5 Dataset = [1,2,3,4,5,6,7,8,9] X_train = [[1,2,3,4,5],[2,3,4,5,6],[3,4,5,6,7],[4,5,6,7,8]] Y_train = [ [6] , [7] , [8] , [9] ]

Now with this data the network is trained and here comes into play the batch_size. Let's suppose batch_size=2. So, before the weights in the neural network are updated 2 samples are showed to the NN, e.g.:

X = [[1,2,3,4,5], [2,3,4,5,6]] Y = [ [6] , [7] ]

From what I have understood, in a stateless LSTM, what is happening is the following. Samples in the same batch (hence X[0] and X[1] in the example above) are processed in parallel. The initial state of the LSTM memory cell is initialized randomly for all the samples (X[0] and X[1]) in the batch, then for a single sample (e.g X[0]) the network is trained using BPTT. Hence,for example the LSTM memory cell value for the 2nd value of X[0] (which is 2), will use the value of the memory LSTM cell state computed with the 1st value of X[0] (which is 1) and so on until all the time_window_size number of values in X[0] are processed and the final output is computed. The latter will be compared against Y[0] etc.. Same is happening for X[1]. At the end of the batch, using the value computed with the 2 samples the weights are updated. Am I understanding this correctly? Is it correct to train an LSTM for time-series forecasting in this way? If yes, then why in various papers is stated that: " Because of this ability to learn long term correlations in a sequence, LSTM networks obviate the need for a pre-specified time window and are capable of accurately modelling complex multivariate sequences" ? Any help appreciated.

ylmeng commented 6 years ago

I have a simple question. If I trained a stateful model and use a loop to perform predictions. Such as: for data in data_set: prediction = model.predict(X, batch_size)

Will the model be stateful between the iterations? I mean, when predict() is called, does the model maintain states from the end of previous prediction?

shrikanth95 commented 6 years ago

@ylmeng I had the same question. See this.

lmxhappy commented 6 years ago

@shrikanth95 The answer is not clear and I am not clear about its meaning. Does it means the states are updated?

bstriner commented 6 years ago

For a good example of stateful LSTMs (not keras though), see this relatively recent paper. Describes a process of using stateful LSTMs to go over the entire wikitext corpus while maintaining hidden state. Several key points about how you have to generate batches, reset state, vary sequence length, etc.

TL;DR: Backprop does not work between batches. However, if you vary the batch boundaries, and randomly reset the hidden state, you can get reasonable generalization for longer sequences.

https://arxiv.org/pdf/1708.02182.pdf https://github.com/salesforce/awd-lstm-lm

TristanJM commented 6 years ago

@mik1904 I've had the same question on my mind for hours now! Did you have any luck? @bstriner maybe you could advise? I feel like this could almost be a FAQ

Example: Dataset = [1,2,3,4,5,6,7,8,9] X_train = [[1,2,3,4,5],[2,3,4,5,6],[3,4,5,6,7],[4,5,6,7,8]] Y_train = [[6],[7],[8],[9]]

Given time-series dataset (eg. forex), learn patterns in the data to enable future prediction that: model.predict([20,21,22,23,24]) -> [25]

Objectives:

Assess (theoretically) how shifting windows approach can be improved with batch size.
Assess how windows can be implemented in a stateful model.

I'm hoping this will help myself and others deciding between stateful vs stateless LSTMs and save time implementing window forecast models. You can tune hyperparams with an algorithm to assess the prediction performance but it would be beneficial to understand the theory first!

Stateless LSTM with windows:

Batch size of 1 would mean a single window is fed in, and size 2 would mean two windows are passed in training before LSTM weights are updated.

I was of the understanding eg. from this article (by @jbrownlee) that a batch size containing all the windows would learn the sequence best:

This suggests that if we had a batch size large enough to hold all input patterns and if all the input patterns were ordered sequentially, that the LSTM could use the context of the sequence within the batch to better learn the sequence.

To take advantage of a larger batch size learning a longer sequence, should the training be set to shuffle = False? Shuffling will randomise the windows inside each batch, therefore losing any long term sequence found by retaining state after sequential windows. eg. Shuffle True: [[4,5,6,7,8], [2,3,4,5,6]] False: [[1,2,3,4,5],[2,3,4,5,6]] (better because the LSTM continues from the first window's state)
Or is the whole point of window training that the network only finds patterns in each independent window rather than an entire dataset? ie. if you wanted to find a pattern in a longer sequence, should you instead increase the size of each window. If this is the case, how does a larger batch size help?

Stateful LSTM:

Could the problem be solved better by a stateful LSTM? The state is maintained after each batch and reset after each epoch, meaning the data must be fed sequentially. This would let a model learn the whole dataset.

For this to succeed, the number of windows must be a multiple of the batch size for training and prediction. Eg. For a batch size of 2: Train: [[window 1][window 2], [window3][window4], ... ] Predict: [window10][window11] -> [x, y] where x is the prediction of value after window 10 (start of window 11), and y is the prediction we care about (into the future)

For a stateful LSTM, should the training data not be supplied in windows of 1 step? As the input is sequential, would the model be expecting [1-5][6-10][11-15]... instead of the current overlapping windows.

Many thanks in advance! 😄

ashu5644 commented 5 years ago

@bstriner I didn't find anything related to stateful lstm in https://arxiv.org/pdf/1708.02182.pdf paper and why it should be used in language model. Can you describe some points related to it in short or provide relevent link?

keras-team / keras

Simple stateful LSTM example #6168