This repo contains the codebase of a series of research projects focused on adapting vision-language models like CLIP to downstream datasets via multitask prompt learning:
This code is built on top of the toolbox Dassl.pytorch and CoOp so you need to install the dassl
and PyTorch environment first. After that, run pip install -r requirements.txt
under MVLPT/
to install a few more packages required by CLIP (this should be done when dassl
is activated). Then, you are ready to go.
Follow DATASETS.md to install the datasets from CoOp for multitask source prompt initialization or run the following script after install gdown
.
bash scripts/data.sh
Note that the dataset for target ELEVATER benchmark will be downloaded automatically in MVLPT/trainers/vision_benchmark/
.
Click a paper below to see the detailed instructions on how to run the code to reproduce the results.
--model-dir
and --load-epoch
(see this script for example).
If you use this code in your research, please kindly cite the following papers
@article{shen2022mvlpt,
title={Multitask Vision-Language Prompt Tuning},
author = {Shen, Sheng and Yang, Shijia and Zhang, Tianjun and Zhai, Bohan and Gonzalez, Joseph E. and Keutzer, Kurt and Darrell, Trevor},
journal={arXiv preprint arXiv:2211.11720},
year={2022}
}