Hello, it seems that I couldn't find any explicit information about the backbone architecture used for CLIP in the papers. I'm uncertain whether it is based on ViT or ResNet, and which specific model it is. I would greatly appreciate it if you could kindly provide me with any relevant details regarding the model. Thank you very much for your assistance!
Hello, it seems that I couldn't find any explicit information about the backbone architecture used for CLIP in the papers. I'm uncertain whether it is based on ViT or ResNet, and which specific model it is. I would greatly appreciate it if you could kindly provide me with any relevant details regarding the model. Thank you very much for your assistance!