sjenni / temporal-ssl

Video Representation Learning by Recognizing Temporal Transformations. In ECCV, 2020.
https://sjenni.github.io/temporal-ssl/
GNU General Public License v3.0
48 stars 1 forks source link

Sampling Technique #7

Open Hussein-A-Hassan opened 3 years ago

Hussein-A-Hassan commented 3 years ago

Hello Thank you friend for sharing your work and knowledge. I am sorry for asking these question but I am not familiar with tensor flow at all.

Please could you clarify the following questions:

1- During the down stream task (action recognition) training, did you sample one clip from each training video using random starting index ? If Yes, then at each epoch the total number of training videos would be equal to the size of the training split.

Or
Did you use temporal jittering during training? If Yes how many clips did you sample from each training video ? 
What is the size of one epoch then ?

2- During down stream task evaluation, you mentioned in the paper that you used all the sub sequences of each testing video in the test split to get the video level prediction. What if the testing video length is not divisible by the clip length, then there would be extra frames that are not enough to sample one clip ? What is your approach to over come this issue ?

For example: When the testing video has a 173 frames and the clip length is 16 frames then 10 non overlapping clips 
can be sampled and  13 extra frames that are not enough to sample one clip are left over.

Thanks for your help

sjenni commented 3 years ago

Hi, No worries, I'm happy to answer questions.

  1. The number of epochs is defined by the number of videos in the training set. I apply random temporal cropping during the preprocessing.
  2. In the code I actually use a maximum of 32 clips from each video (see parameter num_test_seq=32 in the Preprocessor). All the clips are uniformly sampled over the duration of the video (the clips are often overlapping as a result).

Hope that clarifies your questions.