Coca Image-to-Text, RuntimeError: Boolean value of Tensor with more than one value is ambiguous

sky-cake commented 2 weeks ago

When running the exact code from https://github.com/mlfoundations/open_clip?tab=readme-ov-file#generating-text-with-coca, with a 1000x1300 pixel image, I get

Traceback (most recent call last):
  File "/home/dolphin/Desktop/clip/main.py", line 21, in <module>
    generated = model.generate(im)
  File "/home/dolphin/Desktop/clip/venv/lib/python3.10/site-packages/open_clip/coca_model.py", line 233, in generate
    output = self._generate_beamsearch(
  File "/home/dolphin/Desktop/clip/venv/lib/python3.10/site-packages/open_clip/coca_model.py", line 442, in _generate_beamsearch
    if beam_scorer.is_done or stopping_criteria(input_ids, None):
RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Doing the following makes it work

if beam_scorer.is_done: # or stopping_criteria(input_ids, None):

rwightman commented 2 weeks ago

@sky-cake I just merged PR from @MengqingCao to this main branch that should address this issue

skwzrd commented 1 week ago

@rwightman I can say it works with a CPU now.

However, my script that throws everything on the GPU doesn't seem to work.

import open_clip
import torch
from PIL import Image

print(torch.cuda.is_available())

model, _, transform = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k",
  device=device,
)

im = Image.open('~/005263.jpg').convert("RGB")
im = transform(im).unsqueeze(0).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
  generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

True
Traceback (most recent call last):
  File "/home/dolphin/Documents/code/clip/main.py", line 43, in <module>
    generated = model.generate(im)
  File "/home/dolphin/Documents/code/clip/open_clip/src/open_clip/coca_model.py", line 240, in generate
    output = self._generate_beamsearch(
  File "/home/dolphin/Documents/code/clip/open_clip/src/open_clip/coca_model.py", line 417, in _generate_beamsearch
    next_token_scores_processed = logits_processor(
  File "/home/dolphin/Documents/code/clip/venv/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 98, in __call__
    scores = processor(input_ids, scores)
  File "/home/dolphin/Documents/code/clip/venv/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 157, in __call__
    eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument test_elements in method wrapper_CUDA_isin_Tensor_Tensor)

vincedovy commented 1 week ago

Just had the same issue. I copied the relevant lines from the new coca_model.py to my local copy and it seems to work now. However isn't the first if statement redundant? if is_done: break if beam_scorer.is_done or is_done: break

rwightman commented 6 days ago

K, have a PR ready ... device mismatch on the token passed to stopping criteria fixed. Also, don't think the any() logic that was originally put in there in the previous PR made sense, pretty sure all() in every case makes more sense and tested scenarios where any was termninating some batch items early.

mlfoundations / open_clip

Coca Image-to-Text, RuntimeError: Boolean value of Tensor with more than one value is ambiguous #898