rmokady / CLIP_prefix_caption

Simple image captioning model
MIT License
1.31k stars 216 forks source link

Conceptual Captions Training #23

Open goel-shashank opened 2 years ago

goel-shashank commented 2 years ago

I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.

rmokady commented 2 years ago

Hi @goel-shashank, Are you using our default parameters?

Did you try both the GPT-2 fine-tuning and the frozen GPT-2?

goel-shashank commented 2 years ago

Hi @rmokady, I tried the default parameters. Do you have the training logs for your run? One thing I'm certainly doing differently is that I have trained a separate CLIP model (RN50 with 20% imagenet zero-shot accuracy) which is trained on CC3M (not OpenAI's pretrained). The prefixes are generated from this model. I don't think this should be causing these issues.

rmokady commented 2 years ago

For COCO where we train both prefix and GPT-2 the loss got to 1.47 Unfortunately, the logs for the Conceptual got left on an old server and cannot access these anymore 5 epochs for 3M images is a lot using the standard clip

Anyway, outputting the same sentence for any prefix usually means there is a bug somewhere

goel-shashank commented 2 years ago

As I mentioned, I was able to fit on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. But still, I will check everything for once. Closing this issue! Please let me know if you find something useful!

rmokady commented 2 years ago

Hi @goel-shashank, I find some logs for conceptual captions. This is with the resnet CLIP: Screen Shot 2022-02-19 at 16 33 41

rmokady commented 2 years ago

This is with the Vit CLIP: Screen Shot 2022-02-19 at 16 35 36

ycchanau commented 2 years ago

I have the same problem with my own dataset. It keeps generating similar captions...

surisdi commented 2 years ago

Hi, I have the same problem for Conceptual Captions + frozen model. Do you have loss values for that scenario? All the inputs end up converging to the same prefix. Thanks!

I followed the README and ran:

python parse_conceptual.py --clip_model_type ViT-B/32 --data_root /path/to/conceptual_captions --num_threads 100

and then

python train.py --only_prefix --data /path/to/conceptual_captions/conceptual_clip_ViT-B_32_train.pkl --out_dir /path/to/output_dir --mapping_type transformer --num_layers 8 --prefix_length 40 --prefix_length_clip 40
mmderakhshani commented 1 year ago

@surisdi Did you manage to reproduce the results?