Closed Fly2flies closed 2 years ago
I think 9 epochs should give a decent result.
I think 9 epochs should give a decent result.
Thank you for your timely reply. I checked again and found that I neglected the effect of random shuffling of sampler in Dataloader while evaluating. Because image features in "evaluation" depend on the order in which the loader loads the data while text features are not, this leads to a mismatch between image and text ids.
Now that the problem has been solved, thank you again for sharing such good work!
Hi, thanks for sharing such a perfect codebase. I tried ALBEF on my large-scale (4M) noisy image-text pair dataset, only adding a new unimodal contrast loss. During pre-training, I noticed that each loss value decreased smoothly (4 RTX 3090, 32 batch size ):
At the 9th epoch of training, I want to test the performance of currenct checkpoint. I found this model to perform poorly on image-text retrieval, only 1% on Recall@1.
Since I have no relevant pre-training experience, I would like to ask if this is normal because my pre-training time is too short?