Open goel-shashank opened 2 years ago
Hi @goel-shashank, Are you using our default parameters?
Did you try both the GPT-2 fine-tuning and the frozen GPT-2?
Hi @rmokady, I tried the default parameters. Do you have the training logs for your run? One thing I'm certainly doing differently is that I have trained a separate CLIP model (RN50 with 20% imagenet zero-shot accuracy) which is trained on CC3M (not OpenAI's pretrained). The prefixes are generated from this model. I don't think this should be causing these issues.
For COCO where we train both prefix and GPT-2 the loss got to 1.47 Unfortunately, the logs for the Conceptual got left on an old server and cannot access these anymore 5 epochs for 3M images is a lot using the standard clip
Anyway, outputting the same sentence for any prefix usually means there is a bug somewhere
As I mentioned, I was able to fit on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. But still, I will check everything for once. Closing this issue! Please let me know if you find something useful!
Hi @goel-shashank, I find some logs for conceptual captions. This is with the resnet CLIP:
This is with the Vit CLIP:
I have the same problem with my own dataset. It keeps generating similar captions...
Hi, I have the same problem for Conceptual Captions + frozen model. Do you have loss values for that scenario? All the inputs end up converging to the same prefix. Thanks!
I followed the README and ran:
python parse_conceptual.py --clip_model_type ViT-B/32 --data_root /path/to/conceptual_captions --num_threads 100
and then
python train.py --only_prefix --data /path/to/conceptual_captions/conceptual_clip_ViT-B_32_train.pkl --out_dir /path/to/output_dir --mapping_type transformer --num_layers 8 --prefix_length 40 --prefix_length_clip 40
@surisdi Did you manage to reproduce the results?
I have trained the model (both MLP and GPT-2) using the CC3M dataset but the loss doesn't seem to decrease very much (stays around 3.0). What loss can I expect for a good model? How many epochs should I run it for? Also, is any specific hyperparameter tuning required for CC? I have a model trained for 5 epochs but it generates a similar caption for every image. I tried fitting on a batch of 512 image-caption pairs and everything works out so I don't think there is any logical issue with the pipeline. Please let me know.