huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.17k stars 26.83k forks source link

SAM can't process batches of nonhomogenous-count of bounding-boxes per image #32488

Open royvelich opened 2 months ago

royvelich commented 2 months ago

System Info

Who can help?

@amyeroberts

Information

Tasks

Reproduction

Run the following code:

from transformers import SamProcessor, SamModel
from PIL import Image
import requests

# Load processor and model
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
model = SamModel.from_pretrained("facebook/sam-vit-base")

# Prepare batch of images and bounding boxes
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
images = [image, image]

bounding_boxes = [
    [[100, 100, 200, 200], [200, 200, 400, 400]],  # bounding boxes for image1
    [[100, 100, 200, 200]],  # bounding boxes for image2
]

# Process the batch
inputs = processor(
    images=images,
    input_boxes=bounding_boxes,
    return_tensors="pt"
)

You should get the following error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Originating from: transformers\models\sam\processing_sam.py (line 142)

Expected behavior

As an end-user, I expect to get 2 masks/results for the first image and 1 mask/result for the second image.

royvelich commented 2 months ago

Can I try to fix this? Or is it too complicated? @amyeroberts

amyeroberts commented 2 months ago

@royvelich Of course! Please feel free to open a PR to fix, ping me when ready for review and feel free to as any q's in the meantime.

royvelich commented 2 months ago

@amyeroberts Sure, I'll work on it. Can I ask questions in this thread if needed?

RaphaelMeudec commented 1 month ago

@amyeroberts what should be the output format for this? The variable input_boxes returned by the processor is currently a tensor, hence difficult to pack elements with different shapes inside it. I see two options: returning a list of tensors instead of a single tensor or returning a padded version of the tensor and a corresponding mask. What do you think about it?

amyeroberts commented 3 weeks ago

@RaphaelMeudec In most of our other models, we process bounding boxes as "labels" which are a list of length batch_size and each element of the list is a BatchFeature. The other alternative is creating a tensor of (batch_size, max_num_boxes, 4) and then correctly masking / filtering the empty annotations when passed to the library. SAM is quite unusual in its API, so I think we can choose either. In both cases, we'll have to account for backwards compatibility and making sure the model can correctly handle the newly formatted input.

cc @yonigozlan