How can I fine-tune for the image-text retrieval task ?

python oscar/run_retrieval.py \ --model_name_or_path vinvl/coco_ir/base/checkpoint-1340000 \ --do_train \ --do_lower_case \ --evaluate_during_training \ --num_captions_per_img_val 20 \ --eval_caption_index_file minival_caption_indexs_top20.pt \ --per_gpu_train_batch_size 16 \ --learning_rate 0.00002 \ --num_train_epochs 30 \ --weight_decay 0.05 \ --save_steps 5000 \ --add_od_labels \ --od_label_type vg \ --max_seq_length 70 \ --max_img_seq_length 70 \ --output_dir output/

I ran the code above with only one GPU, but the result didn't show the convergence. I want to know how to set the epochs, per_gpu_train_batch_size, and the learning rate if I can only use up to three GPUs. Thanks for your help!

microsoft / Oscar

How can I fine-tune for the image-text retrieval task ? #188