Open ChicyChen opened 11 months ago
Hi, ChicyChen,
Thank you! Downsampling by n means selecting 1 frame per n frames, do you first select those frames and then calculate RGBDiff?
Hi, Chicy, For the case of sampling t frames from a length T clip (t < T), we first perform data augmentation to all T frames, then calculate the RGBDiff to obtain (T - 1) frames, and finally sample t frames from the T - 1 frames.
Thank you! How about the result obtained in Table 12? Do you also do self-supervised training on two encoders for RGB and RGBDiff, and average the the similarity of two modalities to do video retrieval?
Yes, for the two-stream retrieval results, we also pretrained two models (RGB & RGBDiff) and average the similarity of two modalities.
Hi, when you extract RGBDiff, do you use frame at t+1 to minus frame at t without downsampling, and then do downsampling, or do you first do downsampling and then extract the difference? And do you first do augmentation and then extract difference, or do you first extract difference and then do augmentation? Also for the 2-stream case using both RGB and RGBDiff, do you train two 3D encoders separately and then test with both by averaging the predicted accuracy? Thank you very much!