ValueError: Batch dimension of `input_ids` should be 0, but is 6.

I am trying to run this code

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)

im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
  generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

But get the following error

Traceback (most recent call last):
  File "demo.py", line 16, in <module>
    generated = model.generate(im)
  File "/home/aghosh/anaconda3/envs/2pcnetnew/lib/python3.8/site-packages/open_clip/coca_model.py", line 233, in generate
    output = self._generate_beamsearch(
  File "/home/aghosh/anaconda3/envs/2pcnetnew/lib/python3.8/site-packages/open_clip/coca_model.py", line 351, in _generate_beamsearch
    raise ValueError(
ValueError: Batch dimension of `input_ids` should be 0, but is 6.

I already tried the solutions in here, including installing transformers=4.30.2 and changing the computation of batch_size, but it does not solve the issue.

mlfoundations / open_clip

ValueError: Batch dimension of `input_ids` should be 0, but is 6. #816