haihuangcode / CMG

The official implementation of Achieving Cross Modal Generalization with Multimodal Unified Representation (NeurIPS '23)
167 stars 2 forks source link

Train on my own dataset #5

Open chouliuzuo opened 8 months ago

chouliuzuo commented 8 months ago

If I'd like to use CMG on my own dataset (for video and audio), how should I prepare the data? I've got video-audio pairs, whether should I extract their features? If yes, what feature extraction model should I use to align with CMG?

haihuangcode commented 8 months ago

Thank you for your interest in our work. You can refer to this section written in the note part of my readme.md: "For the video and audio feature extraction method, please refer to AVE, text is based on the label to generate a description-focused statement of approximately 10 words in length." My audio and video feature extraction is consistent with AVE, where audio uses vggish and video uses vgg19. Save the extracted features into a .pkl file, and then compress it into a .zip file. The default length for audio and video in the current code is 10; if the length of your audio and video is different, please remember to adjust accordingly. If you have any further questions, feel free to reply to me at any time.

RIU-13 commented 3 months ago

Your work is really great. I wnat to know the feature extraction of texts? bag-of-words or other tokenizers?

haihuangcode commented 3 months ago

Your work is really great. I wnat to know the feature extraction of texts? bag-of-words or other tokenizers?

In my code, I'm using the BERT model from this source: https://github.com/imgarylai/bert-embedding. You can replace it with other model based on your specific needs.