Given a prompt, the resulting embedding will be slightly different if it was computed in a batch (batch_size > 1) vs. if it was computed as a single inference.
For example, computing the embedding of a prompt like so:
model.encode_text(clip.tokenize(prompt)) will not be the same as computing the feature of a batched prompt, model.encode_text(clip.tokenize([prompt, prompt, ..., prompt])):
I would not expect any difference in output regardless of the batch size. Here is the code to reproduce this, and I am providing a Colab to reproduce it. The same discrepancy is observed using both CPU and GPU.
import clip
import torch
device = 'cpu'
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
# We will use the same prompt for batch size 1 and batch size N
batch_size = 16
prompt = 'it is a sunny day today'
# Single text vs. batch text
text_no_batch = prompt
text_batch_1 = [prompt] * 1
text_batch_N = [prompt] * batch_size
# Compute features
feature_no_batch = model.encode_text(clip.tokenize(text_no_batch).to(device))
feature_batch_1 = model.encode_text(clip.tokenize(text_batch_1).to(device))
feature_batch_N = model.encode_text(clip.tokenize(text_batch_N).to(device))
assert feature_no_batch[0].shape == feature_batch_N[0].shape
# Check if inference is the same when batch size is 1
is_close = torch.allclose(feature_no_batch[0], feature_batch_1[0])
print(f"Are the samples identical for batch size 1: {is_close}")
# Check if inference is different when batch size != 1
is_close = torch.allclose(feature_no_batch[0], feature_batch_N[0])
print(f"Are the samples identical for batch size {batch_size}: {is_close}\n")
# Find and print indices that aren't identical
not_close_idx = torch.nonzero(~torch.isclose(feature_no_batch, feature_batch_N[0]))[:, 1]
print("Failing indices:")
print(feature_no_batch[0][not_close_idx])
print(feature_batch_N[0][not_close_idx])
print(feature_no_batch[0][not_close_idx] == feature_batch_N[0][not_close_idx])
output:
Are the samples identical for batch size 1: True
Are the samples identical for batch size 16: False
Failing indices:
tensor([-9.9881e-05, 3.2202e-03, 6.7967e-03, 1.5653e-04],
grad_fn=<IndexBackward>)
tensor([-9.9864e-05, 3.2201e-03, 6.7965e-03, 1.5655e-04],
grad_fn=<IndexBackward>)
tensor([False, False, False, False])
Given a prompt, the resulting embedding will be slightly different if it was computed in a batch (
batch_size > 1
) vs. if it was computed as a single inference.For example, computing the embedding of a prompt like so:
model.encode_text(clip.tokenize(prompt))
will not be the same as computing the feature of a batched prompt,model.encode_text(clip.tokenize([prompt, prompt, ..., prompt]))
:I would not expect any difference in output regardless of the batch size. Here is the code to reproduce this, and I am providing a Colab to reproduce it. The same discrepancy is observed using both CPU and GPU.
output: