Open keshunhu opened 2 months ago
Unfortunately, as this project was developed for Shade Inc. (https://shade.inc/), all rights to the model weights are retained by the company. If you have any questions for setting up the data and infrastructure I could help you with that though.
Thanks for your reply, bro. Since I only want to work on the caption generation task, I’m wondering if training solely on the MS COCO dataset would be sufficient. Do you have any recommendations? Also, could you provide an estimate of how many epochs are generally needed for the model to converge? Thanks again!
I experimented with pretraining on a smaller 500K dataset and found that the model suffers from generalizability. In that the model essentially overfits to the data because of how complex training two transformer models are. In addition, the constrastive loss and image-text matching loss require a decent queue size of images that require more data for better convergence, so a larger mini-batch size (say > 32) helps as well. So I would recommend to strive for at least 1M to 2M images at a minimum. As for epoch, I trained on 5M images for 20 epochs and found that it yielded solid performance. If you train on only 1M images you may only need 10 epochs, and then you can finetune the model further on Flickr30K or TextCaps.
Thanks, bro. I really appreciate your guidance. I’ll follow your advice and try training on a dataset of 1-2 million. Looking forward to the results after training!
Brother, I don’t have enough computational resources. Could you share the weights you’ve pre-trained?