Features are slightly different when computing on a batch vs. single inference

Given a prompt, the resulting embedding will be slightly different if it was computed in a batch (batch_size > 1) vs. if it was computed as a single inference.

For example, computing the embedding of a prompt like so:

model.encode_text(clip.tokenize(prompt)) will not be the same as computing the feature of a batched prompt, model.encode_text(clip.tokenize([prompt, prompt, ..., prompt])):

I would not expect any difference in output regardless of the batch size. Here is the code to reproduce this, and I am providing a Colab to reproduce it. The same discrepancy is observed using both CPU and GPU.

import clip
import torch

device = 'cpu'
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

# We will use the same prompt for batch size 1 and batch size N
batch_size = 16
prompt = 'it is a sunny day today'

# Single text vs. batch text
text_no_batch = prompt
text_batch_1 = [prompt] * 1
text_batch_N = [prompt] * batch_size

# Compute features
feature_no_batch = model.encode_text(clip.tokenize(text_no_batch).to(device))
feature_batch_1 = model.encode_text(clip.tokenize(text_batch_1).to(device))
feature_batch_N = model.encode_text(clip.tokenize(text_batch_N).to(device))

assert feature_no_batch[0].shape == feature_batch_N[0].shape

# Check if inference is the same when batch size is 1
is_close = torch.allclose(feature_no_batch[0], feature_batch_1[0])
print(f"Are the samples identical for batch size 1: {is_close}")

# Check if inference is different when batch size != 1
is_close = torch.allclose(feature_no_batch[0], feature_batch_N[0])
print(f"Are the samples identical for batch size {batch_size}: {is_close}\n")

# Find and print indices that aren't identical
not_close_idx = torch.nonzero(~torch.isclose(feature_no_batch, feature_batch_N[0]))[:, 1]
print("Failing indices:")
print(feature_no_batch[0][not_close_idx])
print(feature_batch_N[0][not_close_idx])
print(feature_no_batch[0][not_close_idx] == feature_batch_N[0][not_close_idx])

output:

Are the samples identical for batch size 1: True
Are the samples identical for batch size 16: False

Failing indices:
tensor([-9.9881e-05,  3.2202e-03,  6.7967e-03,  1.5653e-04],
       grad_fn=<IndexBackward>)
tensor([-9.9864e-05,  3.2201e-03,  6.7965e-03,  1.5655e-04],
       grad_fn=<IndexBackward>)
tensor([False, False, False, False])

openai / CLIP

Features are slightly different when computing on a batch vs. single inference #147