keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

How to directly use LSTM to deal with the video sequences with different segments #773

Closed tzczsq closed 9 years ago

tzczsq commented 9 years ago

Hi, I have a problem. I want to use LSTM to classify multi-class video sequences with different segments with fixed dimension. For example, I have two video sequences from 7 classes, Video sequence A has 4 segments, each of which has 600-dimension features, i.e. A=[600_dim,600_dim,600_dim,600_dim] Video Sequence B with 3 segments, i.e., B=[600_dim,600_dim,600_dim] the number of segments represent the time axis. 600-dimensional feaures are float. How to organize the input video sequences as the input of LSTM, such as the shape (nb_samples, timesteps, features)? Is it must be use the embedding layer for this varie-length video sequences?

fchollet commented 9 years ago

How to organize the input video sequences as the input of LSTM, such as the shape (nb_samples, timesteps, features)

We don't currently support sequences of pictures / videos (5D tensors), although we will soon (once we have a recurrent container...). So you couldn't do end to end learning with Keras on a video problem.

But here's what you can do instead:

lukedeo commented 9 years ago

If you're not married to the idea of an LSTM (i.e., you're not trying to do next-frame prediction), you could use an experimental temporally distributed 2D convolution. This accepts 5D tensors of shape (num_samples, num_timesteps, stack_size, num_rows, num_cols), so could work.

wangpichao commented 8 years ago

@fchollet Does Keras now support 5D tensor now? Thanks.

davideboschetto commented 5 years ago

Is there any official (or semi official) way to do video classification starting from video labels (and not single frame labels) using a CNN+RNN solution?

My approach usually is:

  1. Input
  2. TimeDistributed(CNN) -> sequence of "activations" for each frame
  3. Recurrent part (LSTM/GRU)
  4. Flatten+Dense

Point is, training "both models" (the Convolutional and the Recurrent parts) at the same time, with the same optimizer and LR is very hard and almost unfeasible (learning convolutional features while learning sequence weights kind of makes a mess of the whole process).

@fchollet @Dref360 @farizrahman4u @gabrieldemarmiesse @taehoonlee