Image-Text-to-Text Support in Transformers Pipeline

chakravarthik27 commented 3 days ago

Feature request

Implement the new feature to support a pipeline that can take both an image and text as inputs, and produce a text output. This would be particularly useful for multi-modal tasks such as visual question answering (VQA), image captioning, or image-based text generation.

from transformers import pipeline

# Initialize the pipeline with multi-modal models
multi_modal_pipeline = pipeline("image-text-to-text", model="meta-llama/Llama-3.2-11B-Vision-Instruct")

# Example usage
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
result = multi_modal_pipeline(messages )
print(result)  # Should return an answer or relevant text based on the image and question

Motivation

Simplifies workflows involving multi-modal data.
Enables more complex and realistic tasks to be handled with existing Transformer models.
Encourages more multi-modal model usage in research and production.

Your contribution

Transformers Integration Ensure that the pipeline works well within the Hugging Face Transformers library:

Implement the custom pipeline class (ImageTextToTextPipeline).
Add support for handling different data types (image, text) and ensure smooth forward pass execution.

class ImageTextToTextPipeline(Pipeline):
  ....

yonigozlan commented 3 days ago

Good timing ;) https://github.com/huggingface/transformers/pull/34170

NOOB-del-ai commented 1 day ago

I'd like to work on this feature request.

huggingface / transformers