Error when running Grounding DINO for batch inference.

pspdada commented 3 days ago

System Info

transformers version: 4.46.0.dev0
Platform: Linux-5.15.0-120-generic-x86_64-with-glibc2.35
Python version: 3.10.15
Huggingface_hub version: 0.25.2
Safetensors version: 0.4.5
Accelerate version: 1.0.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: None
Using GPU in script?: None
GPU type: NVIDIA A100-PCIE-40GB

Who can help?

@clefourrier @zucchini-nlp @amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

When I use Grounding DINO for batch inference, I encounter some issues: When I perform batch inference on multiple images with different texts, it results in errors. However, if I change the text corresponding to each image to be all the same, there is no problem.

The whole code to reproduce it:

import torch, json
from PIL import Image
from PIL.ImageFile import ImageFile
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection, GroundingDinoProcessor, GroundingDinoForObjectDetection

def open_image_from_url(images: Image.Image | str | list[Image.Image | str]) -> Image.Image | list[Image.Image]:
    def open_single_image(image: Image.Image | str) -> Image.Image:
        if isinstance(image, (Image.Image, ImageFile)):
            img = image
        else:
            img = Image.open(image)
        if img.mode != "RGB":
            img = img.convert("RGB")
        return img

    if isinstance(images, list):
        return [open_single_image(i) for i in images]
    else:
        return open_single_image(images)

model: GroundingDinoForObjectDetection = AutoModelForZeroShotObjectDetection.from_pretrained(
    "IDEA-Research/grounding-dino-tiny",
    cache_dir='/root/llm-project/util/model',
    low_cpu_mem_usage=True,
).to('cuda').eval()

processor: GroundingDinoProcessor = AutoProcessor.from_pretrained(
    "IDEA-Research/grounding-dino-tiny",
    cache_dir='/root/llm-project/util/model',
    padding_side="left",
)

datalist: list[str, str] = []
with open('temp.jsonl', 'r') as f:
    for line in f:
        datalist.append(json.loads(line))

images: list[Image.Image] = [open_image_from_url(data['image_path']) for data in datalist]
captions: list[str] = [data['obj_to_detect'] for data in datalist]

with torch.inference_mode():
    encoded_inputs = processor(
        images=images,
        text=captions,
        max_length=300,
        return_tensors="pt",
        padding=True,
        truncation=True,
    ).to('cuda')

    outputs = model(**encoded_inputs)
    target_sizes = [image.size[::-1] for image in images]
    results: list = processor.post_process_grounded_object_detection(
        outputs,
        encoded_inputs["input_ids"],
        box_threshold=0.3,
        text_threshold=0.3,
        target_sizes=target_sizes,
    )

The content of the source jsonl file is like this: The images is from coco and they are all valid images.

{"image_path": "000000449603.jpg", "obj_to_detect": "person. wave. surfboard."}

{"image_path": "000000565776.jpg", "obj_to_detect": "kitchen. appliance. utensil."}

{"image_path": "000000226903.jpg", "obj_to_detect": "dining. food."}

{"image_path": "000000480936.jpg", "obj_to_detect": "woman. chair. meal."}

{"image_path": "000000276018.jpg", "obj_to_detect": "area. child. adult."}

In the line outputs = model(**encoded_inputs) raise an error:

Traceback (most recent call last):
  File "/root/anaconda3/envs/LVLM/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/LVLM/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
    return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
    _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
  File "/root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
    exec(code, run_globals)
  File "/root/llm-project/LVLM/test.py", line 57, in <module>
    outputs = model(**encoded_inputs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/grounding_dino/modeling_grounding_dino.py", line 2582, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/grounding_dino/modeling_grounding_dino.py", line 2260, in forward
    text_self_attention_masks, position_ids = generate_masks_with_special_tokens_and_transfer_map(input_ids)
  File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/grounding_dino/modeling_grounding_dino.py", line 2050, in generate_masks_with_special_tokens_and_transfer_map
    position_ids[row, previous_col + 1 : col + 1] = torch.arange(
RuntimeError: upper bound and larger bound inconsistent with step sign

In the line before, the input to the processor (images and texts) are:

the images:
[<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x426 at 0x7F5462AF91B0>, 
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x421 at 0x7F5462B688E0>, 
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F5462B68970>, 
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=482x640 at 0x7F5462B68820>, 
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=416x640 at 0x7F5462B688B0>]

The texts:
['person. wave. surfboard.', 'kitchen. appliance. utensil.', 'dining. food.', 'woman. chair. meal.', 'area. child. adult.']

When I change the text input to captions: list[str] = ["person."] * len(images) It's all ok.

Expected behavior

I don't know why different text input in an batch when batch inference will raise error

pspdada commented 1 day ago

Further experimentation has revealed that: For batch inference with data entries more than one, an error occurs when the input texts have different lengths.

zucchini-nlp commented 1 day ago

Hey! I can look into some time this next week, a bit out of bandwidth. In the meanwhile, can you check the comment here in case it helps? :)

daniel-bogdoll commented 23 hours ago

You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:

classes = "dog. cat."
batch_classes = [classes] * data_loader.batch_size

tokenized_text = processor.tokenizer(
                batch_classes,
                padding="max_length",
                return_tensors="pt",
                max_length=256,  # Adjust max_length to match vision hidden state
            ).to(device)

Then, when you iterate over the batches, do not provide the text:

inputs = processor(text=None, images=images, return_tensors="pt").to(device)

And update the inputs: inputs.update(tokenized_text)

before you run

results = processor.post_process_grounded_object_detection(
                    outputs,
                    inputs.input_ids,
                    box_threshold=detection_threshold,
                    text_threshold=detection_threshold,
                )

pspdada commented 23 hours ago

You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:
classes = "dog. cat."
batch_classes = [classes] * data_loader.batch_size
tokenized_text = processor.tokenizer(
                batch_classes,
                padding="max_length",
                return_tensors="pt",
                max_length=256,  # Adjust max_length to match vision hidden state
            ).to(device)
Then, when you iterate over the batches, do not provide the text:
inputs = processor(text=None, images=images, return_tensors="pt").to(device)
And update the inputs: inputs.update(tokenized_text)

before you run
results = processor.post_process_grounded_object_detection(
                    outputs,
                    inputs.input_ids,
                    box_threshold=detection_threshold,
                    text_threshold=detection_threshold,
                )
Could you tell me what Adjust max_length to match vision hidden state in your comment means? I just assumed this value is to ensure that all inputs are less than this number.

pspdada commented 22 hours ago

Hey! I can look into some time this next week, a bit out of bandwidth. In the meanwhile, can you check the comment here in case it helps? :)

processor: GroundingDinoProcessor = AutoProcessor.from_pretrained(
    "IDEA-Research/grounding-dino-tiny",
    cache_dir='/root/llm-project/util/model',
    padding_side="left",
)

I found the cause of the error was setting this line: padding_side="left" during the initialization of processor. After removing it, everything worked fine.

I saw a using tip about model llava 1.5 that encourages us to set padding_side="left" for more accurate results . So I also set it up this way in the Grounding DINO model. Could you tell me which models require setting this line? If setting this line causes errors for some models, should we prevent users from doing so or provide a warning?

daniel-bogdoll commented 22 hours ago

You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:
classes = "dog. cat."
batch_classes = [classes] * data_loader.batch_size
tokenized_text = processor.tokenizer(
                batch_classes,
                padding="max_length",
                return_tensors="pt",
                max_length=256,  # Adjust max_length to match vision hidden state
            ).to(device)
Then, when you iterate over the batches, do not provide the text:
inputs = processor(text=None, images=images, return_tensors="pt").to(device)
And update the inputs: inputs.update(tokenized_text) before you run
results = processor.post_process_grounded_object_detection(
                    outputs,
                    inputs.input_ids,
                    box_threshold=detection_threshold,
                    text_threshold=detection_threshold,
                )
Could you tell me what Adjust max_length to match vision hidden state in your comment means? I just assumed this value is to ensure that all inputs are less than this number.

That was basically a result of debugging. I first did not set it (because OwlV2 does not need it), but then I ran into a few shape mismatches.

huggingface / transformers