Problem Description

I am using the pre-trained ViT-H-14 model from open_clip to compare the similarity of identical texts, and I found that the similarity score is not 1 even for exactly the same inputs. Here is the code snippet I used for testing:

import torch
from open_clip import create_model_from_pretrained, get_tokenizer 

model, preprocess = create_model_from_pretrained('ViT-H-14', pretrained='/data/vjuicefs_ai_camera_llm/11170092/01share/DFN5B-CLIP-ViT-H-14-378/open_clip_pytorch_model.bin')
tokenizer = get_tokenizer('ViT-H-14')

def categorize_text(text, threshold=0.1):
    test_input = tokenizer([text])
    item_input = tokenizer([text])

    test_features = model.encode_text(test_input)
    item_feature = model.encode_text(item_input)

    similarities = (test_features @ item_feature.T)
    print(similarities)

# Call function
categorize_text("looking at camera")

output result:

tensor([[0.4930]], grad_fn=<MmBackward0>)

The similarity result shows as 0.4930, which is far below the expected 1. I would like to know if this is an expected behavior of the model,or if there might be an error in how i am using it ?Or, it is caused by precision problem（but the difference should not be that big）.I would appreciate any help you can provide.Thank you!

mlfoundations / open_clip

Text Similarity not equal to 1 for identical inputs #913

Problem Description