microsoft / M3P

Multitask Multilingual Multimodal Pre-training
MIT License
67 stars 9 forks source link

Problem with datasets and code in fine-tuning on image-text retrieval task #9

Open erfan-ghadery opened 3 years ago

erfan-ghadery commented 3 years ago

Thanks for your nice work. I want to fine-tune the 'Understanding' model on image-text retrieval tasks (Multi30k and MSCOCO). I can't find these datasets with the format required by your model. Also, I think there is a problem in fine-tuning the model on the Multi30k dataset. The --cross_rel_steps parameter has been set to 'flicker' but in loader.py/load_retrieval_data, it needs two inputs to work, src and tgt:

def load_retrieval_data(params, data):
    data['cross_modal'] = {}
    required_cross_modal_train = set(params.cross_rel_steps)  # must need tasks

    for src, tgt in required_cross_modal_train:

        logger.info('============ Cross Modal data (%s-%s)' % (src, tgt))

It would be great if you could provide datasets and other requirements (e.g., image features) with the correct format for running your model. If it's not possible for you, giving a toy example of the input data format would be very helpful. Thanks a lot!

haoyanghua commented 3 years ago

Hi , you can use script like this to run fine-tuning python -m torch.distributed.launch --nproc_per_node=$NGPU ./M3P/train_x.py --data_path $DATA_PATH \ --reload_model $RELOAD \ --dump_path $MODELS \ --exp_name $EXP_NAME \ --batch_size 24 \ --emb_dim 768 \ --n_layers 12 \ --n_heads 12 \ --n_dec_layers -1 \ --dropout 0.1 \ --attention_dropout 0.1 \ --gelu_activation True \ --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.00005 \ --lgs $ALL_LGS \ --data_path $DATA_PATH \ --vocab_path $VOCAB_PATH \ --google_path 'google_captions/obj100' \ --sbu_path 'google_captions/obj100' \ --coco_path coco \ --flicker_path flicker \ --cross_rel_steps flicker-img \ --mlm_steps '' \ --epoch_size 150000 \ --max_epoch 150 \ --bptt 128 \ --max_len 64 \ --fp16 True \ --validation_metrics valid_I2T_acc,valid_T2I_acc \ --max_region_num 100 \ --accumulate_gradients 4 \ --amp 1 \ --refine_image False \ --refine_encoder False \ --input_fea_dir $FEA_PATH \ --save_every_epoch 5 \ --is_generation False \ --is_understanding True \