Open mostafa-saad opened 9 years ago
https://github.com/jeffdonahue/caffe/tree/recurrent would be helpful.
I think it's not different from the example I provided. You can define LSTM layer on top of the K images in prototxt. As @nakosung mentioned, Jeff Donahue has another LSTM implementation (seems to be merged to master branch soon) with examples on images. You can find prototxt from his branch. Thanks.
@junhyukoh Thanks so much. One more question, what are the inputs for your LSTM layer? In an example, it takes 2 elements, data and clip?
@mostafa-saad data is the input for LSTM. Clip is a binary indicator of continuity of the data (sequence). For example, you can give different input sequences (i.e., [1 2 3 4], [1 2 3]) as one input data as follows: data = [1 2 3 4 1 2 3], clip = [0 1 1 1 0 1 1]. "0" indicates head of the sequence. By default, clip is [0 1 1 1 .... 1], which assumes that only one sequence from its head is given as input. You can also do several forward passes for a very long or variable-length sequence. For example, data = [1 2 3 4 5] can be divided into 5 forward passes as follows: 1) data=[1], clip=[0] 2) data=[2], clip=[1] 3) data=[3], clip=[1] 4) data=[4], clip=[1] 5) data=[5], clip=[1] Although this seems very inefficient, it is actually necessary especially when a prediction is used as an input for the next time-step (i.e., text modelling).
I guess you don't have to use "clip" because 1) the input data is always given from the data, and the input sequence is complete (starting from its head, continuous). So, the default clip value should work for your case.
Thanks. Just to make sure I understand you. Assume I am extending Alex Network with 1 LSTM layer, and say I have 3 videos for training. One with 4 frames, other with 3 frames and 3rd one with 5 frames. Clip should be as following: clip = [0 1 1 1 0 1 1 0 1 1 1 1]?
How to use level db to input the clip layer from hard disk not from memory? Is it possible to just provide a text file?
I am just novice in Caffe and still learning, sorry for many questions.
That's correct.
The current data_layer implementation (src/caffe/layers/data_layer.cpp) doesn't support clip. So, you may have to implement your own data layer where the output is data/clip if you want to use leveldb. Another way is to give data/clip directly from your own program like my example code (lstm_sequence.cpp) without using leveldb, but this doesn't run on a separate thread, which might be slower than implementing a new data layer.
What about an ImageData input layer <image, label> pair: images are dummy and the labels be the binary clip input? Do you think this would work?
I think it would work if you give the pair correctly.
Excusme me, if this is a very simple question, but I am just starting with learning using neural networks using caffe.
Is it possible to use this network to train on continuous sequence of 2 variables for ex. [(2.77, 9.03), (2.01, 10.48),.....]
and then predict next in sequence for supplied input?
So for training I could have sequence [[t0]...[t9]] (sequence of 10 time steps)
as input and [t10]
as expected output. And then do the prediction in same manner.
@mecp Yes. It's possible to train the network on multi-dimensional input/output.
@junhyukoh what's the difference between batch size, N and sequence length T?
@HaiboShi In RNN training, a training example is x_{1}, x_{2}, ..., x_{T_}
(sequence). We can define N_
number of such sequences as a mini-batch.
@junhyukoh and the diff of that mini-batch sum up together for updating the weight?
@HaiboShi Yes, the diff is accumulated through mini-batch. However, loss layers usually give normalized diffs to the bottom blobs (by dividing it by the size of mini-batch). So, the weight diff is actually normalized by the size of the mini-batch.
@junhyukoh Thanks. It helps a lot. and there's another specific question:
In lstm layer class, Blob
@junhyukoh while also, it seems that there's no top diff data in backward_cpu() function, I wonder how the gradient from last layer pass to lstm layer? thanks! :100:
@HaiboShi h_to_h_
is an intermediate blob that computes h{t+1} -> h{t} gradient.
There is a top diff in backward_cpu() function in Line 209.
Dtype* top_diff = top_.mutable_cpu_diff()
top_
shares its memory with the actual top
blob.
@junhyukoh hi, thanks for your reply. there's one more question: what is the clippingthreshold standing for? is it related to pregate_gradient?
I notice you do the accumulation for the batch: caffeadd(H, dh_t_1, h_to_h, dh_t_1); does it mean that h{t} gradient is composed by the h{t+1}gradient of all elements in one batch?
@junhyukoh Hi, junhyukoh,I am new to caffe, and I have read your example, and I have two questions here: Firstly, in your example, TotalLength=seq_length=320,which means there is only one sequence for input. However,if I have more sequence,and train thousands of times, after the first time, the clip array turns to all 1, what does it mean if the value of clip is [1,1...]? Continue to input another sequence after the beginning one with a head 0 in clip? (I mean this line: train_clip_blob->mutable_cpu_data()[0] = seq_idx > 0;) The second one is that it is noted that during the test phase, you reshape the input data that I cannot fully understand, also,there is no input data during test,isn't it? Can u explain it, please? (This line: test_data_blob->Reshape(shape); test_clip_blob->Reshape(shape);) l ll appreciate ur answer, thanks a lot!
@junhyukoh And also, what is the difference between data and label? There is an object named 'data', but it is not mentioned in your code!
Hi @junhyukoh, I have a question about "clip" array. Let's say, during training phase, my input "data" is [A B C (eos)], and the desired label is [W X Y Z (eos)], does the "data", "label", and "clip" become something like this:
Data | A | B | C | (EOS) | W | X | Y | Z |
---|---|---|---|---|---|---|---|---|
Label | W | X | Y | Z | (EOS) | |||
Clip | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
@junhyukoh What is the sequence length if I have a feature vector of (10, 50, 4, 4) ?
@junhyukoh When training an LSTM with a single (long) repeated sequence and multiple epochs, should the clip value be 0 at the start of each epoch/data sequence, or just the first epoch?
I am in real need for real examples, specially in vision area :)
Is it possible to feed such input to your network? The sample is so simple and I am not sure if such input is doable.