How to directly use LSTM to deal with the video sequences with different segments

keras-team / keras

Deep Learning for humans

http://keras.io/

Apache License 2.0

61.91k stars 19.45k forks source link

How to directly use LSTM to deal with the video sequences with different segments #773

Closed tzczsq closed 9 years ago

tzczsq commented 9 years ago

Hi, I have a problem. I want to use LSTM to classify multi-class video sequences with different segments with fixed dimension. For example, I have two video sequences from 7 classes, Video sequence A has 4 segments, each of which has 600-dimension features, i.e. A=[600_dim,600_dim,600_dim,600_dim] Video Sequence B with 3 segments, i.e., B=[600_dim,600_dim,600_dim] the number of segments represent the time axis. 600-dimensional feaures are float. How to organize the input video sequences as the input of LSTM, such as the shape (nb_samples, timesteps, features)? Is it must be use the embedding layer for this varie-length video sequences?

fchollet commented 9 years ago

How to organize the input video sequences as the input of LSTM, such as the shape (nb_samples, timesteps, features)

We don't currently support sequences of pictures / videos (5D tensors), although we will soon (once we have a recurrent container...). So you couldn't do end to end learning with Keras on a video problem.

But here's what you can do instead:

train a convolutional per-frame classifier, ending up in a fully connected layer (4D input - 2D output). It is trained on determining the probability that a frame belongs to a video in such or such category.
record the output of the fully connected bottleneck layer over all your frames (2D). Create a sequence of the frames for each video (3D).
Train a LSTM classifier over the 3D sequences.

lukedeo commented 9 years ago

If you're not married to the idea of an LSTM (i.e., you're not trying to do next-frame prediction), you could use an experimental temporally distributed 2D convolution. This accepts 5D tensors of shape (num_samples, num_timesteps, stack_size, num_rows, num_cols), so could work.

wangpichao commented 8 years ago

@fchollet Does Keras now support 5D tensor now? Thanks.

davideboschetto commented 5 years ago

Is there any official (or semi official) way to do video classification starting from video labels (and not single frame labels) using a CNN+RNN solution?

My approach usually is:

Input
TimeDistributed(CNN) -> sequence of "activations" for each frame
Recurrent part (LSTM/GRU)
Flatten+Dense

Point is, training "both models" (the Convolutional and the Recurrent parts) at the same time, with the same optimizer and LR is very hard and almost unfeasible (learning convolutional features while learning sequence weights kind of makes a mess of the whole process).

@fchollet @Dref360 @farizrahman4u @gabrieldemarmiesse @taehoonlee