SAM can't process batches of nonhomogenous-count of bounding-boxes per image

royvelich commented 2 months ago

System Info

transformers version: 4.43.3
Platform: Windows-10-10.0.22631-SP0
Python version: 3.10.14
Huggingface_hub version: 0.24.3
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA GeForce RTX 3090

Who can help?

@amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run the following code:

from transformers import SamProcessor, SamModel
from PIL import Image
import requests

# Load processor and model
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
model = SamModel.from_pretrained("facebook/sam-vit-base")

# Prepare batch of images and bounding boxes
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
images = [image, image]

bounding_boxes = [
    [[100, 100, 200, 200], [200, 200, 400, 400]],  # bounding boxes for image1
    [[100, 100, 200, 200]],  # bounding boxes for image2
]

# Process the batch
inputs = processor(
    images=images,
    input_boxes=bounding_boxes,
    return_tensors="pt"
)

You should get the following error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Originating from: transformers\models\sam\processing_sam.py (line 142)

Expected behavior

As an end-user, I expect to get 2 masks/results for the first image and 1 mask/result for the second image.

royvelich commented 2 months ago

Can I try to fix this? Or is it too complicated? @amyeroberts

amyeroberts commented 2 months ago

@royvelich Of course! Please feel free to open a PR to fix, ping me when ready for review and feel free to as any q's in the meantime.

royvelich commented 2 months ago

@amyeroberts Sure, I'll work on it. Can I ask questions in this thread if needed?

RaphaelMeudec commented 1 month ago

@amyeroberts what should be the output format for this? The variable input_boxes returned by the processor is currently a tensor, hence difficult to pack elements with different shapes inside it. I see two options: returning a list of tensors instead of a single tensor or returning a padded version of the tensor and a corresponding mask. What do you think about it?

amyeroberts commented 3 weeks ago

@RaphaelMeudec In most of our other models, we process bounding boxes as "labels" which are a list of length batch_size and each element of the list is a BatchFeature. The other alternative is creating a tensor of (batch_size, max_num_boxes, 4) and then correctly masking / filtering the empty annotations when passed to the library. SAM is quite unusual in its API, so I think we can choose either. In both cases, we'll have to account for backwards compatibility and making sure the model can correctly handle the newly formatted input.

cc @yonigozlan

huggingface / transformers