[pipeline] VQA pipeline does not accept list as input

System Info

transformers version: 4.42.0.dev0
Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@Narsil @sijunhe

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import pipeline
urls = ["https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png", "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/tree.png"]

oracle = pipeline(task="vqa", model="dandelin/vilt-b32-finetuned-vqa")
oracle(question="What's in the image?", image=urls, top_k=1)

(Truncated) error:

TypeError                                 Traceback (most recent call last)
Cell In[1], [line 11](vscode-notebook-cell:?execution_count=1&line=11)
      [8](vscode-notebook-cell:?execution_count=1&line=8) oracle = pipeline(task="vqa", model="dandelin/vilt-b32-finetuned-vqa", image_processor=image_processor)
      [9](vscode-notebook-cell:?execution_count=1&line=9) # for out in tqdm(oracle(question="What's in this image", image=dataset, top_k=1)):
     [10](vscode-notebook-cell:?execution_count=1&line=10) #     print(out)
---> [11](vscode-notebook-cell:?execution_count=1&line=11) oracle(question="What's in this image", image=urls, top_k=1)

File ~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:114, in VisualQuestionAnsweringPipeline.__call__(self, image, question, **kwargs)
    [107](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:107)     """
    [108](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:108)     Supports the following format
    [109](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:109)     - {"image": image, "question": question}
    [110](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:110)     - [{"image": image, "question": question}]
    [111](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:111)     - Generator and datasets
    [112](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:112)     """
    [113](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:113)     inputs = image
--> [114](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:114) results = super().__call__(inputs, **kwargs)
    [115](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:115) return results

File ~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1224, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   [1220](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1220) if can_use_iterator:
   [1221](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1221)     final_iterator = self.get_iterator(
   [1222](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1222)         inputs, num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   [1223](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1223)     )
-> [1224](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/base.py:1224)     outputs = list(final_iterator)
...
    [120](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:120)         inputs["question"], return_tensors=self.framework, padding=padding, truncation=truncation
    [121](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:121)     )
    [122](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/blaccod/dev/lab24-env/lab24-multimodal-cardinality-estimation/~/dev/lab24-env/lib/python3.10/site-packages/transformers/pipelines/visual_question_answering.py:122)     image_features = self.image_processor(images=image, return_tensors=self.framework)

TypeError: string indices must be integers

This error is reproducible on the latest version (v4.41.2)

Expected behavior

The pipeline should broadcast the same question on all images and execute the model on those image-question pair, as per the documentation

Note: This currently works, but it is not as easy to use as passing the lists directly (and this doesn't allow passing the dataset directly like this):

oracle([{"question": "What's in the image?", "image": url} for url in urls])

huggingface / transformers

[pipeline] VQA pipeline does not accept list as input #31216