salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.46k stars 193 forks source link

Questions about the epochs of pre-training ? #85

Closed Fly2flies closed 2 years ago

Fly2flies commented 2 years ago

Hi, thanks for sharing such a perfect codebase. I tried ALBEF on my large-scale (4M) noisy image-text pair dataset, only adding a new unimodal contrast loss. During pre-training, I noticed that each loss value decreased smoothly (4 RTX 3090, 32 batch size ):

{"train_lr": "0.000", "train_loss_mlm": "2.678", "train_loss_ita": "2.500", "train_loss_itm": "0.456", "epoch": 0}
{"train_lr": "0.000", "train_loss_mlm": "1.789", "train_loss_ita": "2.478", "train_loss_itm": "0.360", "epoch": 1}
{"train_lr": "0.000", "train_loss_mlm": "1.694", "train_loss_ita": "2.373", "train_loss_itm": "0.327", "epoch": 2}
{"train_lr": "0.000", "train_loss_mlm": "1.638", "train_loss_ita": "2.325", "train_loss_itm": "0.307", "epoch": 3}
{"train_lr": "0.000", "train_loss_mlm": "1.598", "train_loss_ita": "2.287", "train_loss_itm": "0.293", "epoch": 4}
{"train_lr": "0.000", "train_loss_mlm": "1.567", "train_loss_ita": "2.250", "train_loss_itm": "0.283", "epoch": 5}
{"train_lr": "0.000", "train_loss_mlm": "1.540", "train_loss_ita": "2.221", "train_loss_itm": "0.273", "epoch": 6}
{"train_lr": "0.000", "train_loss_mlm": "1.515", "train_loss_ita": "2.190", "train_loss_itm": "0.265", "epoch": 7}
{"train_lr": "0.000", "train_loss_mlm": "1.495", "train_loss_ita": "2.157", "train_loss_itm": "0.257", "epoch": 8}
{"train_lr": "0.000", "train_loss_mlm": "1.473", "train_loss_ita": "2.141", "train_loss_itm": "0.250", "epoch": 9}

At the 9th epoch of training, I want to test the performance of currenct checkpoint. I found this model to perform poorly on image-text retrieval, only 1% on Recall@1.

Since I have no relevant pre-training experience, I would like to ask if this is normal because my pre-training time is too short?

LiJunnan1992 commented 2 years ago

I think 9 epochs should give a decent result.

Fly2flies commented 2 years ago

I think 9 epochs should give a decent result.

Thank you for your timely reply. I checked again and found that I neglected the effect of random shuffling of sampler in Dataloader while evaluating. Because image features in "evaluation" depend on the order in which the loader loads the data while text features are not, this leads to a mismatch between image and text ids.

Now that the problem has been solved, thank you again for sharing such good work!