Training maple on multi-class (image-caption) dataset

AhmedBourouis commented 1 year ago

Thank you for this great work and the clear detailed implementation I was wondering if it's possible to train maple on "scene" images like MS-COCO images. 1- What would be an appropriate preprocessing steps? I noticed that in all train/test datasets you worked with you have one folder per class. That won't be possible to duplicate in the case of multi-class images. 2- What changes can be made in the code to adapt the model on multi-class classification during training ? Thank you again for this amazing contribution!

muzairkhattak commented 1 year ago

HI @AhmedBourouis,

Thank you for showing interest in our work.

Yes it is possible to train MaPLe on image-caption pair dataset like COCO-Captions dataset.

1- What would be an appropriate preprocessing steps? I noticed that in all train/test datasets you worked with you have one folder per class. That won't be possible to duplicate in the case of multi-class images.

You do not need to manually do the folder preprocessing. Mainly you would need to implement a custom data-loader, that will return image-caption pairs using which you can train MaPLe further. You can refer to this great tutorial on coming up with a data-loader in pytorch that provides image-text pairs to train CLIP like model.

As MaPLe is based on Dassl Library, you will need to dig a bit inside there as well as this part of the code where you will need to implement your custom loader.

2- What changes can be made in the code to adapt the model on multi-class classification during training ?

In order to classify the given image into multiple classes, you can perform one of the following:

Implement multiple heads with each head representing a single class/task, e.g via a linear layer. You will need to further use sigmoid for classification and binary-cross entropy for each head during training. (you can look into a sample reference code on using various heads for multi-label classification at this link.)
Or simply you can use cross-entropy on top of the cosine multiplication of image and text embeddings and maximize the image-text embeddings of all positive classes at once.

Kindly let me know if that helps to solve you issue.

Thank you and kind regards.

AhmedBourouis commented 1 year ago

Thank you for the clear and detailed answer! You fully answered my question so I'm closing this now.

vrk7 commented 1 year ago

@AhmedBourouis Could you please the training notebook if you have it handy?

muzairkhattak / multimodal-prompt-learning

Training maple on multi-class (image-caption) dataset #9