Open pspdada opened 3 days ago
Further experimentation has revealed that: For batch inference with data entries more than one, an error occurs when the input texts have different lengths.
Hey! I can look into some time this next week, a bit out of bandwidth. In the meanwhile, can you check the comment here in case it helps? :)
You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:
classes = "dog. cat."
batch_classes = [classes] * data_loader.batch_size
tokenized_text = processor.tokenizer(
batch_classes,
padding="max_length",
return_tensors="pt",
max_length=256, # Adjust max_length to match vision hidden state
).to(device)
Then, when you iterate over the batches, do not provide the text:
inputs = processor(text=None, images=images, return_tensors="pt").to(device)
And update the inputs:
inputs.update(tokenized_text)
before you run
results = processor.post_process_grounded_object_detection(
outputs,
inputs.input_ids,
box_threshold=detection_threshold,
text_threshold=detection_threshold,
)
You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:
classes = "dog. cat." batch_classes = [classes] * data_loader.batch_size
tokenized_text = processor.tokenizer( batch_classes, padding="max_length", return_tensors="pt", max_length=256, # Adjust max_length to match vision hidden state ).to(device)
Then, when you iterate over the batches, do not provide the text:
inputs = processor(text=None, images=images, return_tensors="pt").to(device)
And update the inputs:
inputs.update(tokenized_text)
before you run
results = processor.post_process_grounded_object_detection( outputs, inputs.input_ids, box_threshold=detection_threshold, text_threshold=detection_threshold, )
Could you tell me what
Adjust max_length to match vision hidden state
in your comment means? I just assumed this value is to ensure that all inputs are less than this number.
Hey! I can look into some time this next week, a bit out of bandwidth. In the meanwhile, can you check the comment here in case it helps? :)
processor: GroundingDinoProcessor = AutoProcessor.from_pretrained(
"IDEA-Research/grounding-dino-tiny",
cache_dir='/root/llm-project/util/model',
padding_side="left",
)
I found the cause of the error was setting this line: padding_side="left"
during the initialization of processor
. After removing it, everything worked fine.
I saw a using tip about model llava 1.5
that encourages us to set padding_side="left"
for more accurate results . So I also set it up this way in the Grounding DINO model. Could you tell me which models require setting this line? If setting this line causes errors for some models, should we prevent users from doing so or provide a warning?
You could try to tokenize the text beforehand. In my case, I have the same list of classes for each image, but you can change that. I do it like this:
classes = "dog. cat." batch_classes = [classes] * data_loader.batch_size
tokenized_text = processor.tokenizer( batch_classes, padding="max_length", return_tensors="pt", max_length=256, # Adjust max_length to match vision hidden state ).to(device)
Then, when you iterate over the batches, do not provide the text:
inputs = processor(text=None, images=images, return_tensors="pt").to(device)
And update the inputs:
inputs.update(tokenized_text)
before you runresults = processor.post_process_grounded_object_detection( outputs, inputs.input_ids, box_threshold=detection_threshold, text_threshold=detection_threshold, )
Could you tell me what
Adjust max_length to match vision hidden state
in your comment means? I just assumed this value is to ensure that all inputs are less than this number.
That was basically a result of debugging. I first did not set it (because OwlV2 does not need it), but then I ran into a few shape mismatches.
System Info
transformers
version: 4.46.0.dev0Who can help?
@clefourrier @zucchini-nlp @amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I use Grounding DINO for batch inference, I encounter some issues: When I perform batch inference on multiple images with different texts, it results in errors. However, if I change the text corresponding to each image to be all the same, there is no problem.
The whole code to reproduce it:
The content of the source jsonl file is like this: The images is from coco and they are all valid images.
In the line
outputs = model(**encoded_inputs)
raise an error:In the line before, the input to the processor (images and texts) are:
When I change the text input to
captions: list[str] = ["person."] * len(images)
It's all ok.Expected behavior
I don't know why different text input in an batch when batch inference will raise error