kennymckormick / TransRank

[CVPR2022 Oral] The official code for "TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition"
18 stars 0 forks source link

RGBDiff #6

Open ChicyChen opened 11 months ago

ChicyChen commented 11 months ago

Hi, when you extract RGBDiff, do you use frame at t+1 to minus frame at t without downsampling, and then do downsampling, or do you first do downsampling and then extract the difference? And do you first do augmentation and then extract difference, or do you first extract difference and then do augmentation? Also for the 2-stream case using both RGB and RGBDiff, do you train two 3D encoders separately and then test with both by averaging the predicted accuracy? Thank you very much!

kennymckormick commented 11 months ago

Hi, ChicyChen,

  1. Yes, RGBDiff obtained by frame t+1 subtracting frame t. I'm not sure what do you mean by downsampling here. If that means downsampling the original video to 112x112 or 224x224, the answer will be: we first do downsampling, and then calculate the RGBDiff.
  2. All augmentations are conducted before we extract the difference.
  3. Yes, we train two 3D encoders separately, finetune them on the downstream tasks separately, and average the predictions to obtain the two stream accuracy.
ChicyChen commented 11 months ago

Thank you! Downsampling by n means selecting 1 frame per n frames, do you first select those frames and then calculate RGBDiff?

kennymckormick commented 11 months ago

Hi, Chicy, For the case of sampling t frames from a length T clip (t < T), we first perform data augmentation to all T frames, then calculate the RGBDiff to obtain (T - 1) frames, and finally sample t frames from the T - 1 frames.

ChicyChen commented 11 months ago

Thank you! How about the result obtained in Table 12? Do you also do self-supervised training on two encoders for RGB and RGBDiff, and average the the similarity of two modalities to do video retrieval?

kennymckormick commented 11 months ago

Yes, for the two-stream retrieval results, we also pretrained two models (RGB & RGBDiff) and average the similarity of two modalities.