What is the underlying backbone architecture used in the CLIP model?

songpipi / EPAN

Code of Emotion-Prior Awareness Network for Emotional Video Captioning

1 stars 0 forks source link

What is the underlying backbone architecture used in the CLIP model? #1

Open randomx207 opened 8 months ago

randomx207 commented 8 months ago

Hello, it seems that I couldn't find any explicit information about the backbone architecture used for CLIP in the papers. I'm uncertain whether it is based on ViT or ResNet, and which specific model it is. I would greatly appreciate it if you could kindly provide me with any relevant details regarding the model. Thank you very much for your assistance!

songpipi commented 5 months ago

Thank you for your interest in this work! We use the Vision Transformer (ViT) backbone, specifically the ViT-B-32 version.

cxlshmily commented 2 weeks ago

I am also trying to do this code replication,have you succeed in that?