Mr. Right is a novel retrieval dataset containing multimodal documents (images and texts) and multi-related queries. It also provides a multimodal framework for evaluation and compares with previous text-to-text retrieval models and image-text retrieval models. Dataset and model checkpoints are released.
For more details, please checkout our Mr. Right paper.
conda create --name multimodal python=3.8 pandas numpy
conda activate multimodal
pip install -r requirements.txt
wandb login
bash download_dataset.sh
We train our model based on ALBEF, METER, and ViLT.
bash ./checkpoints/download_checkpoints.sh
# dir root: data
python extract_multimodal_val.py --mul_doc multimodal_documents.json \
--mul_val multimodal_val_queries.json \
--val_amount 10000 \
--output multimodal_val_documents.json
CUDA_VISIBLE_DEVICES=0 python main.py \
--num_gpus [number of gpus] \
--num_workers [number of workers] \
--wandb_task_name [Name of task] \
--batch_size 16 \
--pretrain [ALBEF | ViLT | METER] \
--embeds_feats [avg | cls] \
--pl_checkpoint [path for resumed model] \
--save_checkpoint [path for saving checkpoints] \
--neg_matching
--ctx_prediction
--re_ranking
We evaluate our models on a V100 32GB GPU. However, when we calculate the score of TR, IR, and MR simultaneously, the memory size is not enough. Therefore, we store the embeddings checkpoint and calculate the score seperately.
# Run model
CUDA_VISIBLE_DEVICES=0 python main.py \
--num_gpus 1 \
--mode test \
--wandb_task_name [Name of task] \
--pickle_output [Directory of testing pickle files] \
--test_output [Json results of model] \
--batch_size 128 \
--pretrain [ALBEF | ViLT | METER] \
--pl_checkpoint checkpoints/[ albef.ckpt | vilt.ckpt | meter.ckpt] \
# Calculate the score
python compute_pickle.py \
--pickle_input [Embeddings of different retrieval tasks]
This data is available under the Creative Commons Attribution Share Alike 4.0 license.
For any questions please contact r09944010@ntu.edu.tw or c2hsieh@ucsd.edu