How to input sequences of feature vectors to LSTM?

mostafa-saad commented 9 years ago

I am in real need for real examples, specially in vision area :)

I have K images. Through Caffe Feature Extraction, I extracted K feature vectors in leveldb format.
Each feature vector of length 4096
These K images, are actually blocks of frames, each block (say of 10 images) has a label (say video action such as Running)

Is it possible to feed such input to your network? The sample is so simple and I am not sure if such input is doable.

nakosung commented 9 years ago

https://github.com/jeffdonahue/caffe/tree/recurrent would be helpful.

junhyukoh commented 9 years ago

I think it's not different from the example I provided. You can define LSTM layer on top of the K images in prototxt. As @nakosung mentioned, Jeff Donahue has another LSTM implementation (seems to be merged to master branch soon) with examples on images. You can find prototxt from his branch. Thanks.

mostafa-saad commented 9 years ago

@junhyukoh Thanks so much. One more question, what are the inputs for your LSTM layer? In an example, it takes 2 elements, data and clip?

junhyukoh commented 9 years ago

@mostafa-saad data is the input for LSTM. Clip is a binary indicator of continuity of the data (sequence). For example, you can give different input sequences (i.e., [1 2 3 4], [1 2 3]) as one input data as follows: data = [1 2 3 4 1 2 3], clip = [0 1 1 1 0 1 1]. "0" indicates head of the sequence. By default, clip is [0 1 1 1 .... 1], which assumes that only one sequence from its head is given as input. You can also do several forward passes for a very long or variable-length sequence. For example, data = [1 2 3 4 5] can be divided into 5 forward passes as follows: 1) data=[1], clip=[0] 2) data=[2], clip=[1] 3) data=[3], clip=[1] 4) data=[4], clip=[1] 5) data=[5], clip=[1] Although this seems very inefficient, it is actually necessary especially when a prediction is used as an input for the next time-step (i.e., text modelling).

I guess you don't have to use "clip" because 1) the input data is always given from the data, and the input sequence is complete (starting from its head, continuous). So, the default clip value should work for your case.

mostafa-saad commented 9 years ago

Thanks. Just to make sure I understand you. Assume I am extending Alex Network with 1 LSTM layer, and say I have 3 videos for training. One with 4 frames, other with 3 frames and 3rd one with 5 frames. Clip should be as following: clip = [0 1 1 1 0 1 1 0 1 1 1 1]?

How to use level db to input the clip layer from hard disk not from memory? Is it possible to just provide a text file?

I am just novice in Caffe and still learning, sorry for many questions.

junhyukoh commented 9 years ago

That's correct.

The current data_layer implementation (src/caffe/layers/data_layer.cpp) doesn't support clip. So, you may have to implement your own data layer where the output is data/clip if you want to use leveldb. Another way is to give data/clip directly from your own program like my example code (lstm_sequence.cpp) without using leveldb, but this doesn't run on a separate thread, which might be slower than implementing a new data layer.

mostafa-saad commented 9 years ago

What about an ImageData input layer <image, label> pair: images are dummy and the labels be the binary clip input? Do you think this would work?

junhyukoh commented 9 years ago

I think it would work if you give the pair correctly.

mecp commented 9 years ago

Excusme me, if this is a very simple question, but I am just starting with learning using neural networks using caffe.

Is it possible to use this network to train on continuous sequence of 2 variables for ex. [(2.77, 9.03), (2.01, 10.48),.....] and then predict next in sequence for supplied input?

So for training I could have sequence [[t0]...[t9]] (sequence of 10 time steps) as input and [t10] as expected output. And then do the prediction in same manner.

junhyukoh commented 9 years ago

@mecp Yes. It's possible to train the network on multi-dimensional input/output.

HaiboShi commented 8 years ago

@junhyukoh what's the difference between batch size, N and sequence length T?

junhyukoh commented 8 years ago

@HaiboShi In RNN training, a training example is x_{1}, x_{2}, ..., x_{T_} (sequence). We can define N_ number of such sequences as a mini-batch.

HaiboShi commented 8 years ago

@junhyukoh and the diff of that mini-batch sum up together for updating the weight?

junhyukoh commented 8 years ago

@HaiboShi Yes, the diff is accumulated through mini-batch. However, loss layers usually give normalized diffs to the bottom blobs (by dividing it by the size of mini-batch). So, the weight diff is actually normalized by the size of the mini-batch.

HaiboShi commented 8 years ago

@junhyukoh Thanks. It helps a lot. and there's another specific question: In lstm layer class, Blob h_toh; What is that variable standing for? I noticed that it appears only in backward propagation step. Thanks!

HaiboShi commented 8 years ago

@junhyukoh while also, it seems that there's no top diff data in backward_cpu() function, I wonder how the gradient from last layer pass to lstm layer? thanks! :100:

junhyukoh commented 8 years ago

@HaiboShi h_to_h_ is an intermediate blob that computes h{t+1} -> h{t} gradient. There is a top diff in backward_cpu() function in Line 209. Dtype* top_diff = top_.mutable_cpu_diff() top_ shares its memory with the actual top blob.

HaiboShi commented 8 years ago

@junhyukoh hi, thanks for your reply. there's one more question: what is the clippingthreshold standing for? is it related to pregate_gradient?

I notice you do the accumulation for the batch: caffeadd(H, dh_t_1, h_to_h, dh_t_1); does it mean that h{t} gradient is composed by the h{t+1}gradient of all elements in one batch?

kimshao commented 7 years ago

@junhyukoh Hi, junhyukoh,I am new to caffe, and I have read your example, and I have two questions here: Firstly, in your example, TotalLength=seq_length=320,which means there is only one sequence for input. However,if I have more sequence,and train thousands of times, after the first time, the clip array turns to all 1, what does it mean if the value of clip is [1,1...]? Continue to input another sequence after the beginning one with a head 0 in clip? (I mean this line: train_clip_blob->mutable_cpu_data()[0] = seq_idx > 0;) The second one is that it is noted that during the test phase, you reshape the input data that I cannot fully understand, also,there is no input data during test,isn't it? Can u explain it, please? (This line: test_data_blob->Reshape(shape); test_clip_blob->Reshape(shape);) l ll appreciate ur answer, thanks a lot!

kimshao commented 7 years ago

@junhyukoh And also, what is the difference between data and label? There is an object named 'data', but it is not mentioned in your code!

pciang commented 7 years ago

Hi @junhyukoh, I have a question about "clip" array. Let's say, during training phase, my input "data" is [A B C (eos)], and the desired label is [W X Y Z (eos)], does the "data", "label", and "clip" become something like this:

Data	A	B	C	(EOS)	W	X	Y	Z
Label				W	X	Y	Z	(EOS)
Clip	0	0	0	0	1	1	1	1

gabriellapizzuto commented 7 years ago

@junhyukoh What is the sequence length if I have a feature vector of (10, 50, 4, 4) ?

robosmith commented 6 years ago

@junhyukoh When training an LSTM with a single (long) repeated sequence and multiple epochs, should the clip value be 0 at the start of each epoch/data sequence, or just the first epoch?

junhyukoh / caffe-lstm

How to input sequences of feature vectors to LSTM? #2