Understanding the structure of an LSTM network in Keras. Confused Questions

mpgussert commented 7 years ago

Greetings all!

Suppose that I am currently trying to make an agent for a game. After the agent is trained I would like it to be able to accept a vector encoding the current screen, and return a vector describing what actions to take (like a policy network). However, due to the nature of the game, the current screen is NOT the current state of the game. The game state is something that must be built, managed, and remembered by the network internally. I have thus far been working under the assumption that a network with an LSTM layer is the way to go in order to achieve this. (note: I am not actually making a game agent, it just simplifies the description of my problem)

To summarize, for each time step of play, the network receives information about ONLY time step t, and generates some action to take at time step t+1

From my current understanding this is a "many to one" architecture as described here. Is that correct?

If so, then how do I go about training it? Assume I have a large set of screen -> action values. My questions are...

1) would a subsequence of my data be considered a batch? would the input shape to the LSTM units be (1, n_dim), where n_dim is the number of values in my input vector?

2) to make the LSTM units in the layer not return "many" outputs, would I use return sequences = False?

3) When does the LSTM memory get cleared in training? I see that there is a stateful flag that can be used. what precisely does this do in this context? is it cleared after every batch?

EDIT: A more concise version. Suppose I want to make a network that accepts the value, V, of some time series at time t (only one input) and predicts the value of f(V) at time t+1 (one output). How would i train that model? Here's the example code.

import numpy as np
from keras.models import  Sequential
from keras.layers import Reshape, Dense,  Flatten, LSTM, Activation
from sklearn.preprocessing import MinMaxScaler

pi = 3.14159
f = 0.01 #Hz
omega = 2*pi*f

t = np.arange(10000)
state = np.sin(omega*t)
action =np.cos(omega*t)

StateTrans = MinMaxScaler(feature_range=(0,1))
scaledState = StateTrans.fit_transform(state)

ActionTrans = MinMaxScaler(feature_range=(0,1))
scaledAction = ActionTrans.fit_transform(action)

xstate = np.reshape(scaledState,(state.shape[0],1,1))
ystate = np.roll(scaledState,1).reshape(state.shape[0],1)

xaction = action.reshape(action.shape[0],1)
yaction = np.roll(scaledState,1).reshape(action.shape[0],1)

def create_model(nIn, nOut):
   model = Sequential()
   model.add(LSTM(10,input_dim=nIn, input_length=1, return_sequences=True))
   model.add(Flatten())
   model.add(Dense(10))
   model.add(Activation('tanh'))
   model.add(Dense(10))
   model.add(Activation('tanh'))
   model.add(Dense(nOut))
   model.add(Activation('tanh'))
   model.compile('adam', loss='mse')

   return model

 model = create_model(1,1)

 model.fit(xstate, yaction, nb_epoch=10, batch_size=1, verbose=1)
 p = model.predict(xstate)
 Ip = ActionTrans.inverse_transform(p)

in this case, the "game" is the time series defined by sine. the network needs to learn that it has to invert the sine function and convert to cosine at t+1. Training seems to flatline at a loss of 0.127 and i have no idea why...

bstriner commented 7 years ago

It all depends on what you're trying to learn. If you have "correct" choices, you can train the LSTM to predict the correct choices based on previous observations. If you only have rewards and are trying to learn a Q function, similarly train the LSTM to predict rewards from observations and actions.

Basically, you will need to do a "rollout": let's say play the game for 100 steps 32 times. Then train the LSTM on those 32 sequences. Then do another rollout.

For your example, try breaking the one sequence of 10,000 samples into 100 sequences of 100 samples.

Make sure to save some rollout data for validation so you can see if your network is learning.

Cheers, Ben

mpgussert commented 7 years ago

Thank you for the reply!

So, focusing on the example, I don't fully follow what you are saying. Right now my data is a rank 3 tensor of dimensions (samples, time steps, value features) where time steps is 1 and value features is 1. I do this because i need a model that examines only one value, V, and returns f(V).

Is there a way to feed my model 100 sequences of 100 samples without changing the input shape of the model?

I know this model can predict V (simply change yaction to ystate in model.fit), so I don't THINK it's outside the scope of what an LSTM can do, but honestly, I may be completely wrong about that

bstriner commented 7 years ago

If you only want 1 timestep used for prediction, then you shouldn't be using an LSTM. A simple MLP can predict the next timestep. An LSTM can work, but why would you be using one? What's the point?

Also, if you want your LSTM to learn something, you need many training examples, Each sequence is a training example, so that means you should be generating many subsequences.

The sin/cos example is not markovian given just one timestep. To predict the next value you need at least two timesteps. For example, if sin(t) is 0, you can't tell me what sin(pi/2) is. Sin(t)=0 at sin(pi) and sin(0), so sin(t+pi/2) could either be 1 or -1. So, you will need to feed sequences of at least length 2 to solve the sin problem.

So, if you want an LSTM to take in sin(t) and output cos(t), it is going to need to use a sequence length of k. Your data should then be (batch_dim, k, 1).

In your example code, you generate a single sequential list of cos and sin, but pass only individual samples to the LSTM.

As a rough example, imagine this generator that generates sequences of 10 samples starting from a random point [0,pi*2]. The generated shapes are (1,10,1), which Keras will collect into batches of (batchdim, 10, 1).

def mydata():
  while True:
    x = np.random.random()*np.pi*2
    t = np.arange(x, x+10).reshape(1,10,1)
    yield (np.sin(t), np.cos(t))

Training on this generator should learn to predict cos from sin. The main differences between this and your original code:

Train several random sequences with different starting points
Learns every value, not just integer values

Cheers, Ben

mpgussert commented 7 years ago

I think the matrix is slowly opening itself before me. a couple things

first, i'm not just feeding integer values to the LSTM i'm feeding it something like 100 samples per cycle of sine. Also, as I stated, the model i have there can indeed predict the sine curve on it's own, just fine.

I think I am seeing what you are saying though. Keras is doing something under the hood here with the LSTM i think. If I train using data of (batch_dim,k,1), will i the network require k values to make a prediction after training? or 1?

once the model is trained i want it to have the behavior of give it a value , get a predicted f(V). I would like the model to know where it is on the sine curve because of the values i have fed it previously, so, for example, if i feed it sin(0) it knows which cos value to use because the value i fed it before was sin(0.001).

does that make sense? it's very likely i'm being super dense

bstriner commented 7 years ago

The example is feeding sin([0, 10000]), which is many possible values, but not as good as writing a generator that can randomly sample from all values.

LSTMs are designed to use past experiences to predict the future. Think about a MDP. The future values are dependent on the last k steps of input. If k>1, you need to use an LSTM with mutliple timesteps. If future values are only dependent on the current input, you can use an MLP and you don't need an LSTM.

The real issue is that no one can predict the sin() cos() problem given only the current values. If you set k=10, the best trained network is going to get t=0 wrong and the next 9 values right.

In the sin cos example you need the last two values to make a prediction. So imagine you aren't using an LSTM. If you just trained a dense MLP to predict sin(t) from cos(t), it could never learn to do it correctly. If you train a dense MLP to predict sin(t) from cos(t) and cos(t-1), it would be able to predict correctly. So you would have to generate inputs (batchdim, 2,1) and outputs (batchdim,1,1).

The LSTM example works with arbitrary timescales, so you don't have to know ahead of time how far back you have to look. So while the MLP with the last 2 samples will work in the toy problem, an LSTM is better for broader problems.

If you ran my generator code to generate samples of 10 timesteps, you would see that the network never gets the first timestep correct, but would get the rest of the timesteps correct.

For prediction, you would have to get a sample of the last 10 timesteps, and get a prediction of the next timestep. You could of course pad your data and pass only the last 2 timesteps to your model but in a generic prediction problem you don't know how many steps you need and the more steps the LSTM has the better chance of it having enough.

Basically, LSTM is only really meaningful with >1 timestep. You use an LSTM because you think your model needs the last k steps of data, not just that last 1 step. Therefore it doesn't make sense to do predictions either with just 1 step of data. You need to make predictions given the last k steps of data.

With return_sequences=True, the LSTM will have k input steps and k output steps.

So in terms of usage, you are best off storing past values in your code. When you want to predict t, pass t, t-1,t-2... to the model as a single sequence of shape (1,k,1). That will give you predictions of shape (1,k,1), and you should take the last prediction.

mpgussert commented 7 years ago

note: I have zero experience with MLPs or MDPs

I also just realized my toy example is the worst possible example i could have picked. the model works if i look ahead by a small amount, or look ahead such that it the target looks like +- sine (I had a whole response written with more questions...).

The reason WHY it works isn't because it's able to predict the time series, it's because there is only one point in the data where sin(omega*t) is possibly degenerate (exactly zero) and that's the first data point... everywhere else it's only ever close to zero and that almost zero value is always different...

This has given me a LOT to think about though, thank you so much! I have some actual toy data to play with instead of my sine curve.

https://datamarket.com/data/set/22u3/international-airline-passengers-monthly-totals-in-thousands-jan-49-dec-60#!ds=22u3&display=line

I'm going to try to make a model to predict log(v) given only v as an input. If I succeed I will post a synopsis here. if not, i will be back and confused again XD

Thank you again!

mpgussert commented 7 years ago

@bstriner Thank you so much! I have determined that my issue is a matter of degeneracy in my time series. Most of the time series I am trying to get the network to predict is a constant value, with intermittent changes triggered by the input time series (like a digital signal that is mostly zeros). I'm currently trying to figure out the best way to approach this problem. Can you help me one last time by recommending some reading? Does this kind of time series / problem have a name that I can google?

bstriner commented 7 years ago

Kind of a fundamental issue with LSTMs. They can theoretically learn long term dependencies but in practice they degrade after some number of timesteps. There are a lot of techniques for helping with long term dependencies like skip connections. Google around long term dependencies in LSTMs or something like that. If you're not married to LSTMs, you can look up ARIMA and other time series models. You could also play with GRUs, stacked LSTMs, BiDirectional LSTMs, etc., depending on what makes sense for your data.

Try an LSTM, see how it works, and go from there. Make sure your sequences are at least long enough that the long term dependency is captured in most random samples but short enough that you can do a ton of sequences in a batch.

de-code commented 7 years ago

Very good explanation.

Hearing that LSTMs can make predictions based on a sequence and with long term dependencies might sound like it remembers the previous states when actually the model usually learns how to make predictions but doesn't store the previously passed in states. Therefore one would feed the model with a series of states.

I thought it might be worth pointing out that Keras does support stateful LSTMs (which might be more in line with the original understanding): http://philipperemy.github.io/keras-stateful-lstm/ https://github.com/fchollet/keras/blob/master/examples/stateful_lstm.py https://github.com/fchollet/keras/issues/2328 https://keras.io/layers/recurrent/

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

hanikh commented 7 years ago

@bstriner I am really confused. what is the difference between batch-size and time-step in updating the weights and backpropagation?

dpshorten commented 6 years ago

I'm bumping this up, I'm also experiencing the same confusion as hanikh.

keras-team / keras

Understanding the structure of an LSTM network in Keras. Confused Questions #4973