Brother, I don’t have enough computational resources. Could you share the weights you’ve pre-trained? - Githubissues

jfan1256 / distill-blip

Distill BLIP (Knowledge-Distillation for Image-Text Deep Learning Tasks). Supports pretraining and caption/retrieval finetuning on Multi-GPU or Single-GPU training for On Prem and Cloud VM. Handles preprocessing datasets, which are downloaded using Img2Datasets for CC3M, COCO, Flickr30k, and VGO.

MIT License

1 stars 1 forks source link

Brother, I don’t have enough computational resources. Could you share the weights you’ve pre-trained? #1

Open keshunhu opened 2 months ago

keshunhu commented 2 months ago

Brother, I don’t have enough computational resources. Could you share the weights you’ve pre-trained?

jfan1256 commented 2 months ago

Unfortunately, as this project was developed for Shade Inc. (https://shade.inc/), all rights to the model weights are retained by the company. If you have any questions for setting up the data and infrastructure I could help you with that though.

keshunhu commented 2 months ago

Thanks for your reply, bro. Since I only want to work on the caption generation task, I’m wondering if training solely on the MS COCO dataset would be sufficient. Do you have any recommendations? Also, could you provide an estimate of how many epochs are generally needed for the model to converge? Thanks again!

jfan1256 commented 1 month ago

I experimented with pretraining on a smaller 500K dataset and found that the model suffers from generalizability. In that the model essentially overfits to the data because of how complex training two transformer models are. In addition, the constrastive loss and image-text matching loss require a decent queue size of images that require more data for better convergence, so a larger mini-batch size (say > 32) helps as well. So I would recommend to strive for at least 1M to 2M images at a minimum. As for epoch, I trained on 5M images for 20 epochs and found that it yielded solid performance. If you train on only 1M images you may only need 10 epochs, and then you can finetune the model further on Flickr30K or TextCaps.

keshunhu commented 1 month ago

Thanks, bro. I really appreciate your guidance. I’ll follow your advice and try training on a dataset of 1-2 million. Looking forward to the results after training!