Closed mikeogezi closed 1 year ago
Hello! Yes you are correct. For this work, I just grabbed the natural language labels used for ImageNet in the original CLIP work (can be found here: https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb).
I agree that distinguishing between the classes with the same names would help improve accuracy -- and more generally, perhaps using different natural language labels altogether may also help! WordNet provides synonyms for each synset id so perhaps that may work for disambiguating.
I chose not to adjust these labels for the sake of comparison to the baseline, but this is a good idea to improve accuracy in the future!
Thank you for the work, Sarah.
I've noticed that the prompts JSON file uses the class names as keys. The class names are not unique. Specifically, there are two instances of "missile" (rocket and projectile) and two instances of "sunglasses" (sunglass and shades). The current setup gives both classes the exact same prompts and text embeddings.
When we do the argmax at the end, we always pick the earlier choice (based on the order of classes) and get 0% accuracy in the second one, since we never predict it.
We could fix this by using synset ids as keys and adding some context to disambiguate the duplicate class names to the prompt.