Trying to create a Many-to-one LSTM

dberma15 commented 7 years ago

Hi All,

I'm fairly new to keras and I was looking to create a CNN that feeds into an LSTM for video classification. Essentially, I have a whole stream of images from a video. Each image is 512x512 pixels. Each sequence of videos is designated a single class, so I do not have a label for each individual image.

Here is what I have so far:

model = Sequential()
model.add(Convolution2D(3, 3, 3, input_shape=(1, 512, 512), activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(3, 3, 3, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Reshape((1,49152)))
model.add(LSTM(128))
model.add((Dense(1)))
model.add(Activation("sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Earlier I populate the variabel X_train with the images using a for loop that grabs all the images from each sequence, so they're stored in this variable:

X_train[i][0,:,:]=np.array(img) I then store X_train and y_train as: X_train=np.array(X_train) y_train=np.array(y_train)

However, when I try to train the model using: model.train_on_batch(X_train, y_train)

I get the following error:

ValueError: input arrays should have the same number of samples as target arrays. Found 195 input samples and 1 target samples.

What am I doing wrong?

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

[ x] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

unrealwill commented 7 years ago

Hello, What is the result of : print( X_train.shape ) print( y_train.shape ) before train_on_batch

The error you give is because you probably didn't gave one label per image.

Currently your network is not a video classification but an image classification. What you could do is either give the training label of the whole video sequence to each individual image. I mean for example if you are trying to determine "Porn or not", usually you can make a prediction for each individual image and then eventually combine them later. Sometimes the image is not not a strong indicator, but then it doesn't really matter (if you take care of class imbalance to remain unbiased), because it would be seen as additional noise.

Doing a full video approach usually consume too much memory, and it is not practical to upload the full video at once, so you use a LSTM but with stateful=True, and you feed it image by image.(Have a look at the stateful documentation https://keras.io/layers/recurrent/#lstm Note on using statefulness in RNNs).

Here at each time you should again give the label for the whole video sequence.

dberma15 commented 7 years ago

print(X_train.shape) returns: (timesteps,1,512,512)

print(Y_train.shape) returns: (1,)

where timesteps are the number of images from the video, and 512 are the dimensions of the image.

It's not a video classifier, per say. I have individual images from a video that need to be classified and I cannot go through each image to determine if it's class 1 or class 0 because that requires highly specialized knowledge that I do not have. Therefore, I have to make a classification on the sequence of images, for which I have the class.

I'm having a bit of trouble with applying stateful. Here's the updated code:

model = Sequential() model.add(Convolution2D(3, 3, 3, input_shape=(1, 512, 512), activation='relu', border_mode='same')) model.add(Dropout(0.2)) model.add(Convolution2D(3, 3, 3, activation='relu', border_mode='same')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same')) model.add(Dropout(0.2)) model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same')) model.add(Dropout(0.2)) model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Reshape((1,49152))) model.add(LSTM(128, batch_input_shape=(1, 1, 1),return_sequences=False,stateful=True)) model.add((Dense(1))) model.add(Activation("sigmoid")) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

I then loop through the pictures stored in each directory:

for direc in solutions['id']: X_train=[] files=[f for f in os.listdir(direc) if (os.path.isfile(os.path.join(direc,f)) and ".dcm" in f.lower())] y_train=solutions['cancer'][solutions['id']==direc] X_train=np.zeros(1,512,512) for file,i in zip(files,range(0,len(files))): ds=dicom.read_file(os.path.join(direc,file)) img=ds.pixel_array X_train[0,:,:]=np.array(img) X_train=np.array(X_train) y_train=np.array([y_train]) print('going to train') model.train_on_batch(X_train, y_train) model.reset_states() But I get the error:

ValueError: If a RNN is stateful, a complete input_shape must be provided (including batch size). I did include that, though. So I'm confused.

unrealwill commented 7 years ago

You are on the right track. batch_input_shape must be specified for the first layer.The rest of the shapes are then automatically inferred and are therefore ignored (btw your batch_input_shape was wrong and should have been (1,1,49152) (which is a very big number) ).

dberma15 commented 7 years ago

So you're saying it should look like this:

model.add(Convolution2D(3, 3, 3, input_shape=(1, 512, 512),batch_input_shape=(1,1,49152), activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(3, 3, 3, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(6, 1, 1, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same'))
model.add(Dropout(0.2))
model.add(Convolution2D(12, 1, 1, activation='relu', border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Reshape((1,49152)))
model.add(LSTM(128,return_sequences=False,stateful=True))
model.add((Dense(1)))
model.add(Activation("sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

unrealwill commented 7 years ago

Nope, sorry if I misexplain. First layer : model.add(Convolution2D(3, 3, 3, batch_input_shape=(1,1,512,512), activation='relu' ) The LSTM layer is correct.

Once you have your model written. You should begin debugging by calling model.predict( [input] ,batch_size=your_batch_size_here_1 ) with input of the correct batch_size Then hopefully it will make a prediction, which will have a shape.

The shape for y_train should be of the same shape of the prediction you made. So you can usually use np.ones( your_target_shape ) to debug fit.

Then you should feed it your data .

unrealwill commented 7 years ago

The alternative architecture solution (for which you process the whole sequence of image ). It needs more memory, but it will truly be a many-to-one solution, with cleaner back-propagation through time of the gradients.

model = Sequential() model.add( TimeDistributed( Convolution2D( ... ) ) ) model.add( TimeDistributed( MaxPooling2D( ... ) ) ) model.add( TimeDistributed( Convolution2D( ... ) ) ) model.add( TimeDistributed( Flatten()( ... ) ) ) modef.add( LSTM ( return sequence = False, stateful = False ) model.add((Dense(1))) model.add(Activation("sigmoid")) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

This model will take as input a 5D tensor of shape (batch_shape,timesequence,nb_row,nb_col, nb_channel ) And will output a 2D tensor of shape (batch_shape, 1)

It can probably be written in a more nice fashion by introducing a submodel.

imageEmb = Sequential() imageEmb.add( Convolution2D() ) imageEmb.add( MaxPooling2D()) imageEmb.add( Convolution2D() ) imageEmb.add( MaxPooling2D() ) imageEmb.add( Flatten() )

model = Sequential() model.add (TimeDistributed( imageEmb ) ) modef.add( LSTM ( return sequence = False, stateful = False ) model.add((Dense(1))) model.add(Activation("sigmoid")) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

dberma15 commented 7 years ago

Here is the loop I'm using to train my model:

for direc in solutions['id']:
    print(np.where(solutions['id']==direc)[0])
    X_train=[]
    files=[f for f in os.listdir(direc) if (os.path.isfile(os.path.join(direc,f)) and ".dcm" in f.lower())]
    y_train=solutions['cancer'][solutions['id']==direc]
    y_train=np.array([y_train.iloc[0]])
    pbar = progressbar.ProgressBar(maxval=1).start()
    for file,i in zip(files,range(0,len(files))):
        ds=dicom.read_file(os.path.join(direc,file))
        img=ds.pixel_array
        X_train=np.zeros((1,512,512))
        X_train[0,:,:]=np.array(img)
        X_train=np.array([X_train])
        model.train_on_batch(X_train, y_train)
        pbar.update(i/len(files))
    pbar.finish()
    model.reset_states()

For some reason all the predictions are the same value. Did I do something conceptually wrong? I have not used that much data to train it because I just wanted to test it out.

unrealwill commented 7 years ago

Looking at your code, nothing wrong jump to my eyes. You can probably display some additional info to monitor the training process. tr_loss (, tr_acc if you have added accuracy metrics) = model.train_on_batch(...) print (tr_loss,tr_acc) You can probably try to display y_train to check it isn't always the same value.

You can try to remove the Dropout to make the network deterministic. If you do not have that many data, you should be able to overfit your training data easily (and reach tr_acc=100%). Only then you need to worry about overfitting.

dberma15 commented 7 years ago

I've removed the drop outs and I tried testing on some of the training data and here's what I get after training on the first 300 data points: i=0 Actual: 1 Prediction: 0.96294081 i=1 Actual: 0 Prediction: 0.96294081

I realize that 300 might be a small number of examples to train on, but I find it odd that it's giving the exact same value as the output. Am I wrong in thinking this is weird or am I just not using enough data?

unrealwill commented 7 years ago

Yep this is weird you probably have a bug. Try to print what your input to your predict. Try simplifying your network to the bare minimum. Try displaying the losses during training. Find the bug, and work your way back Up.

dberma15 commented 7 years ago

Based on the code I sent you, it's not that it's only saving the latest iteration, correct?

So I'm trying to identify cars in a video. In the videos where there is a car, parts of the video do not have a car. Could that be a source of the problem? There's overlap between data points with a car and data points without?

unrealwill commented 7 years ago

Having exactly the same prediction (to the last decimal) for two different inputs, usually means there is a not subtle bug and you should probably find it before looking for more subtle issues like the ones I'll expose below:

The more subtle issue is the (temporal) credit assignment (of reinforcement learning) problem. Using stateful=True and sequenceLength=1, it will be very hard for the network to solve by itself.

The point of having a LSTM or GRU is to remember that it has seen cars in previous frame so it should output there is a car in the whole video. Basically the model to make a prediction is taking in input :

a state which hopefully represents at least the info ("The probability there was a car in at least one of the previous frame"),
the current image And will try to predict the probability that a car is present in the whole video. (even if a car has not appeared up to now) and to predict a new state to forward to its future self. This will be a noisy estimation (more noise for the first frames of the videos than for the last, that you can compensate by giving more weight to the updates of the latter frames of the video using sample_weight).

Because you are using stateful (and sequence length of 1), the model will have some problems to learn to transfer the hidden state efficiently (because the gradient of the states can't flow back thought time), but you should at least be able to make prediction based on the current image, and eventually the model weights will randomly wander and find a way to transfer through time a single bit information ("car present until now").

The alternative version I suggested a few comments earlier will be able to learn more efficiently the state transfer through time. The alternative version is usually the easiest one but, consume more memory.

There are other alternative routes you can take. The easy one is to have your individuals images labelled (either manually or semi automatically via semi or unsupervised learning) then you just have to "filter" (like particle filter or kalman filter) based on individual image predictions. The hard one is reinforcement learning.

dberma15 commented 7 years ago

I think I may have discovered the problem. While the values it returns are not exactly the same, they are incredibly close. The reason appears to be that it seems to use the last training value as the baseline for the answer. So if the last training example was class 0, it will return something close to 0. If the last training example was class 1, it will return something close to 1.

unrealwill commented 7 years ago

This could be the case if your video sequence are too long. You are streaming one example at a time. Use a bigger batch_size instead of 1 where you will show it some positive and negative samples at the same time. Alternatively you could also use SGD instead of adam and use a smaller step size, so that you don't overfit to last seen examples.

dberma15 commented 7 years ago

I have tried reducing the learning rate and switching to SGD to no avail.

dberma15 commented 7 years ago

I've realized I'm having this problem with every model I've created.

I wanted to try to build a two level classifier out of the one here: http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/ but I run into the same problem no matter how many epochs I use, it just gives the same value for the answer every single time.

import numpy
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers.convolutional import Convolution2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K
K.set_image_dim_ordering('th')
import tensorflow
from keras.datasets import cifar10
from matplotlib import pyplot
from scipy.misc import toimage
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

#finds everything that's either in the first or second class that appears. I know that they're different, I made sure. One is class 6 and one is class 9 but I change them to be class 0 and class 1. 
print(y_train[0])
print(y_train[1])
is0=y_train==y_train[0]
is1=y_train==y_train[1]

y_train0=y_train[is0]
y_train1=y_train[is1]

class0=[0 for y in y_train0]
class1=[1 for y in y_train1]

X_train0=X_train[numpy.where(is0)[0]]
X_train1=X_train[numpy.where(is1)[0]]

y_training=numpy.concatenate((class0,class1),axis=0)
y_trianing =np_utils.to_categorical(y_training)
X_training=numpy.concatenate((X_train0,X_train1),axis=0)
num_classes=1
X_training = X_training.astype('float32')
y_training2=numpy.zeros((y_training.shape[0],1))
y_training2[:,0]=y_training
X_training = X_training / 255.0

model = Sequential()
model.add(Convolution2D(32, 3, 3, input_shape=(3, 32, 32), border_mode='same', activation='relu', W_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(Convolution2D(32, 3, 3, activation='relu', border_mode='same', W_constraint=maxnorm(3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512, activation='relu', W_constraint=maxnorm(3)))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

# Compile model
epochs = 2
lrate = 0.01
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.9, decay=decay, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

print(y_train.shape)
print(y_training2.shape)

#train the model. You can either use batch or full.
model.fit(X_training, y_training2, nb_epoch=epochs,batch_size=32)
#for j in range(0,epochs):
#    for i in range(0,y_training.shape[0]):
#        model.train_on_batch(numpy.array([X_training[i]]), numpy.array([y_training2[i]]))

predict0=model.predict_on_batch(numpy.array([X_train[0]]))

predict1=model.predict_on_batch(numpy.array([X_train[1]]))

print(y_training2[0])
print(predict0)

print(y_training2[1])
print(predict1)
scores = model.evaluate(X_training, y_training, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

unrealwill commented 7 years ago

If you use softmax as last activation -> use categorical_cross_entropy or sparse categorical cross entropy If you use sigmoid as last activation -> use binary cross entropy

You didn't copy pasted properly.

There are probably other mistakes.

Try to run some working code you find somewhere, then try to rewrite it as a simpler model to replicate. This way you can learn to debug them by yourself, and then you can build your way up. Try to build your model in multiple steps and try to find ways to verify your network as you build it so as to check that you have not introduced a bug. Try reaching the smallest running model that you are 100% confident that it is working proprely, and then step by step, you mutate it into the layer which accomplish what you desire, while checking the validity.

This is quite hard to debug other people code because mistakes can be of any types and anywhere. It is not very useful for any of us because it doesn't teach you to build your own networks which you can trust.

dberma15 commented 7 years ago

I looked into more detail of the first model I've been working with and the weights do not seem to change, or if they do, it's almost negligible

forthCNN.txt

.

dbsousa01 commented 6 years ago

Sorry to bring this up again @unrealwill but I am having a problem with how to present the training data. I also want to analyse video and let's say I have 100 videos and in each 5 frames so 500 frames total. How do I build the training data so I can feed a 5D vector to my neural network? I suppose that the input shape should be (nb of frames, nb of sequence, rows, cols, channels) where nb of frames is 500 (?) and the nb of sequence is between 1 and 5 depending the order of the frame in each video. Am I thinking correctly? Thank you

keras-team / keras

Trying to create a Many-to-one LSTM #5338