openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

Open KevinNWalker opened 6 months ago

KevinNWalker commented 6 months ago

I seem to get different results from encode_text when providing text as a single batch, or as part of several batches.

See the code sample below:

` import clip import numpy

device = 'cuda'
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

# Process 'I am happy' as a single batch
text_1 = clip.tokenize( 'I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)
feature_1_np = feature_1.detach().cpu().numpy()
text_1_np = text_1.detach().cpu().numpy()

# Process 'I am happy' with a second batch
text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)
feature_2_np = feature_2.detach().cpu().numpy()[0]
text_2_np = text_2.detach().cpu().numpy()[0]

print(f'Max diff in tokens {numpy.abs(text_2_np-text_1_np).max()}')
print(f'Max diff in features {numpy.abs(feature_2_np-feature_1_np).max()}')`

When I run this I get the following results:

Max diff in tokens 0 Max diff in features 0.000732421875

Is this to be expected or am I using the code incorrectly?

Many thanks

bonjour-npy commented 6 months ago

Hi there👋

I think your code is correct and the result of your code is also correct.

To the best of my knowledge (I can't be certain it's 100% right), the tokenization of the text in CLIP model is taking corresponding values from the fixed dict vocab. In other words, if the input text is the same, the tokenizer will always return the same result. That's why text_1, text_2[0] and text_2[1] are exactly the same.

But when it comes to function encode_text, it's generated by nn.Embedding, and the output of the Embedding Layer may be affected by the batch_size and context of the input or something else like that.

Here's a simple test:

import torch
import numpy
import clip

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

text_1 = clip.tokenize('I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)

text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)

text_3 = clip.tokenize(['I am happy', 'I am happy', 'I am happy'], truncate=True).to(device)
feature_3 = clip_model.encode_text(text_3)

text_4 = clip.tokenize(['I am happy', 'I am sad', 'I am angry'], truncate=True).to(device)
feature_4 = clip_model.encode_text(text_4)

print((text_1 - text_2[0]).sum(), '\n', (text_1[0] - text_3[0]).sum(), '\n', (text_1[0] - text_4[0]).sum())
print((feature_1 - feature_2[0]).sum(), '\n', (feature_1[0] - feature_3[0]).sum(), '\n', (feature_1[0] - feature_4[0]).sum())

And my output is shown below:

tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0')
tensor(0.0089, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>)

We can jump to a conclusion (not meticulous enough apparently) from the result of text3 and text4, the context didn't affect the output of encode_text, it was the batch_size.

If you'd like to further communicate, feel free to reach out to me at nipeiyang@163.com.