Open GitOutOfMyBed opened 11 months ago
text = clip.tokenize(classes).to(device)
change this one to something like this!
torch.cat([clip.tokenize(f"a photo of a {c}") for c in classes]).to(device)
The results on the paper depends on how they have set up the preprocessing data, I am not sure how they put it!
I tried that and got better results but they were still nowhere near 88%.
Performance varies a lot depending on the description template and model you choose. For instance, when you use: "a photo of the number {}", the top-1 accuracies over MNIST are:
RN50 - 51.28 RN101 - 44.34 RN50x4 - 56.64 ViT-B/32 - 40.11 ViT-B/16 - 53.95
Whereas, if you use 80 templates of Imagenet, you will get:
RN50 - 26.5 RN101 - 36.14 RN50x4 - 59.53 ViT-B/32 - 30.16 ViT-B/16 - 44.22
According to the paper, they were able to achieve 80% accuracy on MNIST. However, I tested mnist on clip and get only 10-20%. Does anyone know what I am doing wrong here?