openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.7k stars 3.21k forks source link

Clip's capablity of detecting scene or background information #389

Open Seeker98 opened 12 months ago

Seeker98 commented 12 months ago

how is clip performing on global information detection? For example, finding whether an image is noisy-corrupted, downsample-d or hazy, and furthermore, choosing the right corruption parameters like noise std? I tried images with different types of noises like gaussian poisson or gamma, and other corruptions like downsampling or hazy, and tokens like [gaussian noise with std=25, gaussian noise with std=50], [noisy, hazy], but the inference result is not well. Am i missing any key parts on my way of testing?

mgupta70 commented 7 months ago

CLIP is not for text generation. CLIP needs text from the user as input to create its embeddings, which matches with the image and you can get a cosine similarity score. CLIP is not good for fine-grained classification (as mentioned in the paper) CLIP is trained on internet data, which may focus more on everyday objects. For example, if an image of a car on the internet is noisy/hazy, a high chance that the text description over the internet still mentions 'car' and not about 'noise'.