salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.57k stars 199 forks source link

change english text_encoder to other language? #117

Open jammyWolf opened 1 year ago

jammyWolf commented 1 year ago

Hello author, thx to the great work! i want to use ALBEF to train another language-image multi model, i am a little confused about the finetune procedure.

Here's my options below:

  1. load your repo's pth file, and iterates the parmeters.
  2. load parameters from Bert model: bert-base-chinese to ALBEF model which tensor name contains text_encoder to pretrained
  3. freeze the parameters in ALBEF model which tensor name contains visual_encoder.

code like this below: `

tokenizer = BertTokenizer.from_pretrained(args.text_encoder) #load chinese bert pretrained model
model = ALBEF(config=config, text_encoder=args.text_encoder, tokenizer=tokenizer)
model_dict = model.state_dict()
# load parameters in your ckpt file, but leave out tensors which name contains text_encoder
temp = {}
pretrained_dict = torch.load(args.checkpoint, map_location='cpu')['model']
for k, v in pretrained_dict.items():
    if k.find("text_encoder") == -1 and model_dict[k].shape==v.shape:  
        temp[k] = v

# replace parameters in text_encoder and freeze visual_encoder
temp_update = {}
for k, v in model_dict.items():
    if k in temp.keys():
        if k.find("visual_encoder") != -1:
            temp[k].requires_grad = False
        temp_update[k] = temp[k]
    else:
        temp_update[k] = v
model_dict.update(temp_update)
model.load_state_dict(model_dict)

`

finally i found bad recall score in flicker-cn dataset, could you give me some advise?

LiJunnan1992 commented 1 year ago

Hi, it won't work if you directly replace bert-en to bert-cn, as the parameters of these two models are different. ALBEF is pre-trained using bert-en and cannot be directly applied to bert-cn.

ozanciga commented 1 year ago

hey @LiJunnan1992 what is your opinion on using adapters to update the pretrained model? or something like low rank adaptation (lora)? wondering if such partial training can be applied to alignment of these models. seems possible, but i would like an expert opinion as setup may be costly in time and compute. thank you!