v-iashin / MDVC

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)
https://v-iashin.github.io/mdvc
142 stars 19 forks source link

Hello, author. May I ask whether it is necessary to distinguish the features of training set or test set when extracting multi-modal features? #25

Open LiMxStar opened 2 years ago

LiMxStar commented 2 years ago

Because I saw that all the features you have extracted in your code are a file, which is to directly extract all the features of the video into an HDF5 file, without distinguishing between the training set and the test set. Hope you can extract valuable events to answer the question.

v-iashin commented 2 years ago

Hi, issue starter!

It is easier than you think. The features are extracted for the whole video regardless of the dataset part. During training, we can simply trim the feature stack according to the start and end timestamps. During test, you can download the predictions of the proposal generator from BAFCG.

Note, the code might not be adapted for the train set; only for train and two validation sets.

How to distinguish? Well, we have separate files for audio, speech, and visual features. These are uni-modal.