Open nandiya opened 6 years ago
hi @nandiya , have you figured it out the answer? if yes then please share here. thank you.
You're confusing 20 channels with 20 frames.
At time t, the current frame is sent into the spatial stream as it is. The optical flow for t to t+10 frames is computed and stacked together as 20 channels (10*2 for x-y axis). Now this 20 channel input is used for the temporal stream. This produces a class score at each t and they are fused. The final video score is obtained by averaging over all frame scores.
Have a look at this paper: https://arxiv.org/pdf/1406.2199v2.pdf
Basically we stack consecutive 10 optical flow images and form a single 10*2 input (x,y) . If a video has less than 10 frames then we discard that video IMO. Can you confirm @stillbreeze ?
There's no video with <10 frames. Even at 30 fps, 10 frames just means a 0.33 ms video!
I modified the code a little bit. in wadhwasahil code, it takes optical flow ( x and y) every 5 frames in 1 video and i still can't figure it out how to solve the different length videos. So i modified it a little bit for my thesis proposal (since my videos data are variance of lengths). I generate 1 video to many frames ( i don't care about the different length, which means let's say 1 have 4 videos which have length 3s, 3s, 4s , 5s. it could generate 124, 127, 143, 150 frames). let's say i wish to take optical flow (x and y) every 20 frames, it will be like this:
n = sum of frames( 1 video) % 20 (because i wish to take it every 20 fames) for j in range( 1 , sum of frames - n , round((sum of frames-n)/20)): -- do take the optical flow ( more or less is the same with wadhwasahil code, mine is a little bit longer since there are some problems with my OpenCV and PIL ).
explanation : 124%20 = 4 --> 120/(120/20) --> will get 20 optical flow ( x and y) 143%20 = 3 --> 140/(140/20) --> will get 20 optical flow ( x and y) if you still don't understand try to imagine the math calculation by yourselves.
That way i could get the same sum of optical flow (x & y) every video, and i don't have to care about the different length. next i just i need to use the optical result to cnn^^.
sorry, i do not still quite understand in temporal stream code. ufc video's length is different from each other, which makes it produce different length of frames, let's say video 1 produces 30 frames while video 2 produces 15 frames. but it seems that in temporal stream code you just take 10 optical frames which means only 20 frames. does that mean the rest of the frames are useless?? how about video which generates fewer than 20 frames??