Video compression/decoding methods of each dataset in CLIP-ViP

microsoft / XPretrain

Multi-modality pre-training

Other

471 stars 37 forks source link

Video compression/decoding methods of each dataset in CLIP-ViP #17

Closed fadzaka12 closed 1 year ago

fadzaka12 commented 1 year ago

Hi, I'm trying to reproduce the CLIP-ViP result. In the readme file, it is mentioned that the data preprocessing step follows HD-VILA. However, in the configuration files of the downstream task, it seems the compression/decoding method is different from these. Are these video preprocessing method correct:

MSR-VTT: compression, 6 FPS
LSMDC: no compression/decoding, use raw video as is
ActivityNet: decoding lr
DiDeMo: compression, X FPS (What is the number of X? Is it 6 too?)

HellwayXue commented 1 year ago

Hi, your listed preprocessing methods are right. For DiDeMo, we keep 32 frames for each video thus the fps is variable.