jayleicn / ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
https://arxiv.org/abs/2102.06183
MIT License
704 stars 86 forks source link

Fine-tuning ClipBERT on custom datasets #42

Closed awkrail closed 2 years ago

awkrail commented 2 years ago

Hi, thank you for sharing this interesting work!

I would like to try fine-tuining ClipBERT on other video-and-language dataset, such as YouCook2. My target downstream task is cross-modal retrieval in sentence-level, rather than paragraph-level.

Do you have any recommendations to train ClipBERT on custom datasets? In particular, I am curious about how to decide hyper-parameters described in config files for other datasets. Thank you.

jayleicn commented 2 years ago

Hi @misogil0116, Thanks for your interest in our work! You can start with one of our retrieval tasks such as msrvtt retrieval and then adapt it to other datasets of interest.

awkrail commented 2 years ago

@jayleicn Hi, thank you for your quick response! I have other question about the train.jsonl. When loading this file, each element in the array has three keys: caption, clip_name, and sen_id. I guess that the clip_name should be same to the key in lmdb file, which stores video binary files. Is this correct?

In [3]: with open("/mnt/LSTA6/data/nishimura/misc/clipbert/txt_db/msrvtt_retrieval/train.jsonl") as f:
   ...:     train_data = [json.loads(l.strip("\n")) for l in f.readlines()]
   ...:

In [4]: train_data[0]
Out[4]:
{'caption': 'a cartoon animals runs through an ice cave in a video game',
 'clip_name': 'video2960',
 'sen_id': 0}
jayleicn commented 2 years ago

Yes, you are correct!

awkrail commented 2 years ago

Thank you!