keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.92k stars 19.45k forks source link

LSTM causality for time-series prediction #4945

Closed icelighter closed 7 years ago

icelighter commented 7 years ago

If I have a time-series X with dimensions (steps, features) = (10000, 10) and a corresponding time-series Y with dimensions (steps, states) = (10000, 2) and I feed them into an LSTM layer with a TimeDistributedDense layer do the outputs respect causality? I'm trying to learn the time-series Y by feeding only the time-series X into the network; there is a functional mapping between the two.

As a concrete example, say I break up X into samples of 100 timesteps:

feautures = 10 timesteps = 100 output = 2 model = Sequential() model.add(LSTM(50, return_sequences=True,input_shape=(timesteps, features))) model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(output,init='uniform',activation='linear'))) model.compile(loss='mse',optimizer='rmsprop',)

For each sample, this network will give output_vector = (100, 2), with the loss calculated against the corresponding timesteps in Y.

Would the RNN be able to 'cheat' by looking ahead in each sample of 100 timesteps and thus infer what the 100th timestep for the output_vector should be, or does the network respect causality, in the sense that each input timestep is processed one at a time and the output_vector is built up sequentially? Put another way, do the LSTM layers have access to all 100 timesteps simultaneously when calculating each layer's output?

patyork commented 7 years ago

A simple, unidirectional RNN (regardless of the neuron types used) by definition 'respects causality'. The output at time t+1 is dependent upon the output at time t therefore requiring the output at time t to be calculated first.

This is not the case for bidirectional RNNs which perform output calculations both forward and backwards through time.

http://www.deeplearningbook.org/contents/rnn.html

icelighter commented 7 years ago

Thanks for the quick response @patyork. I'm not very familiar with bidirectional RNNs. In the context of the example I gave, would that network be considered unidirectional or bidirectional? Also, if the example network is considered undirectional, then I believe there's a bug because I have tested a similar network and the performance of the LSTMs cannot be accounted for unless it has privileged information from 'future' timesteps.

Also, I've noticed that there are many tutorials that utilize Keras LSTMs for time-series prediction but the training time-series is broken up such that X[i:i+100] is used to predict X[i+101] or some variations thereof. However, in my case, it would have to be X[i:i+100] is used to predict Y[i+100]. In the latter scenario, the TimeDistributedDense layer would be replaced by a Dense layer. But if the original LSTM example I gave respects casuality then the performance between the two networks should be similar. However, from tests I have run the network with the Dense layer drastically underperforms with respect to MSE loss.

patyork commented 7 years ago

Without using the Bidirectional wrapper or the simple birnn layer (which I think exists but isn't documented), it will be a regular recurrent network - so your example would be a standard move-forward-in-time recurrent network. There is no bug, but recurrent networks, especially those with LTSM neurons, can learn to "expect" a future output when it sees a certain input; for example, if a '7' in a sequence is always (or very usually) followed by a '9' in the training data, the network may learn to output '7' and then a '9' whenever it sees a 7 - this is just the network learning a pattern.

I don't fully understand the question here, can you send me an example that does that? In a recurrent network, the output at time t is dependent upon all inputs up to and including time t, as some information is passed forward at each time step. That's not to say that at time t the network is passed all the inputs at once, but that there is a dependency on the previous data.

As a note, the Dense layer can now/has now replaced the TimeDistributedDense layer and the TimeDistributed(Dense()) layer; Dense is now smart enough to know when it needs to have a temporal dimension to it, as of the versions of Keras from the last week or two.

patyork commented 7 years ago

If you don't trust that the network is not cheating, you can test on one full sample of say, 1000 timesteps and save the output. Then you can test on just the first 500 timesteps and get 500 outputs.

You can check for yourself using the below script:

'''Train a simple deep CNN on the CIFAR10 small images dataset.

GPU run command with Theano backend (with TensorFlow, the GPU is automatically used):
    THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python cifar10_cnn.py

It gets down to 0.65 test logloss in 25 epochs, and down to 0.55 after 50 epochs.
(it's still underfitting at that point, though).
'''

from keras.models import Sequential
from keras.layers import LSTM, Dense, TimeDistributed
import numpy as np

feautures = 10
timesteps = 100
output = 2
model = Sequential()
model.add(LSTM(50, return_sequences=True,input_shape=(None, feautures)))
model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(output,init='uniform',activation='linear')))
model.compile(loss='mse',optimizer='rmsprop',)

np.random.seed(2017)

sequences = np.random.random_sample((1, timesteps, feautures))

first10 = model.predict(sequences[:1, :10])
allofthem = model.predict(sequences[:1, :])

assert(first10.shape == (1, 10, 2))
assert(allofthem.shape == (1,100,2))

assert(np.allclose(first10, allofthem[:, :10, :]))

print first10
print allofthem[:, :10, :]

print first10 - allofthem[:, :10, :]
linxihui commented 7 years ago

First, I am sure you should forget about Bidirectional for time series and causality. Unidirectional RNN should be the right choice for time series, and when it predicts y_t, it only looks backward in time.

If you stick to use length=100(length possible dependency at most 100 step back), depending on your segmentation (time to samples), each of y_t many appear multiple times since you return a sequence of Y. Why don't you use x[t:t+100] and return y[t+100] only, for each time-phrase/sample? i.e.,

feautures = 10
timesteps = 100
output = 2
model = Sequential()
model.add(LSTM(50, return_sequences=True,input_shape=(None, feautures)))
model.add(LSTM(50, return_sequences=False))
model.add(Dense(output,init='uniform',activation='linear'))
model.compile(loss='mse',optimizer='rmsprop',)

If you don't care about the maximum length of dependency, you could probably try stateful RNN.

icelighter commented 7 years ago

@patyork I went through your example script and its a convincing argument. Though would it make a difference if you had specified an input_shape=(timesteps, features) ? In the example you gave it seems that there is no bptt as there are no timesteps? Also, going back to my previous comment, if I make a small modification to your script:

feautures = 10 timesteps = 100 output = 2 timeseries_length = 1000

np.random.seed(2017) X = np.random.random_sample((timeseries_length, features)) Y = np.random.random_samples((timeseries_length,output)) model = Sequential() model.add(LSTM(50, return_sequences=True,input_shape=(timesteps, feautures))) model.add(LSTM(50, return_sequences=False)) model.add(Dense(output,init='uniform',activation='linear')) model.compile(loss='mse',optimizer='rmsprop',)

Then using a moving window on X such that input sequences become X[:100],X[1:101],X[2:102], etc and the target is now Y[99], Y[100],Y[101], etc where we have discarded the first 99 points of Y.

Then I should expect the performance of the above network to be roughly equivalent to the example you gave? @linxihui I think this is also what you are referring to.

patyork commented 7 years ago

The (None, features) shape provides a quick way to allow variable length inputs (meaning I can pass 10 timesteps or 10000), you can think of the None as meany "any"; I'm pretty sure BPTT would apply above. Specifying the number of timesteps is not entirely necessary for you to pass through a rolling window; rather, you can do that windowing and just pass 100 items at any time. This would allow you to quickly change to, say passing 200 inputs in without recompiling the model.

That said, doing the windowing to 100 timesteps will still maintain the causality of the network. However, as you've described, the network will have an enforced "look back" of 100 timesteps versus a look-back to the beginning of time (or, the beginning of the sequence if you'd rather) that the network would have without the windowing.

And to answer the question: yes doing the windowing and discarding the first 99 targets would be equivalent. Interestingly, depending on what you are trying to learn, you may be able to keep all of the Y targets, and start by feeding increasing-length sequences. But this would only apply if you don't have good reason to require 100 timestep sequences.

solenbanson commented 7 years ago

@icelighter Your model looks like a judge machine.if output y is one-hot, the result maybe batter. Cause keras now has no bptt support, use statefull Rnn and define loss and otherthing yourself maybe the choice.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.