Implement the new feature to support a pipeline that can take both an image and text as inputs, and produce a text output. This would be particularly useful for multi-modal tasks such as visual question answering (VQA), image captioning, or image-based text generation.
from transformers import pipeline
# Initialize the pipeline with multi-modal models
multi_modal_pipeline = pipeline("image-text-to-text", model="meta-llama/Llama-3.2-11B-Vision-Instruct")
# Example usage
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
]}
]
result = multi_modal_pipeline(messages )
print(result) # Should return an answer or relevant text based on the image and question
Motivation
Simplifies workflows involving multi-modal data.
Enables more complex and realistic tasks to be handled with existing Transformer models.
Encourages more multi-modal model usage in research and production.
Your contribution
Transformers Integration
Ensure that the pipeline works well within the Hugging Face Transformers library:
Implement the custom pipeline class (ImageTextToTextPipeline).
Add support for handling different data types (image, text) and ensure smooth forward pass execution.
Feature request
Implement the new feature to support a pipeline that can take both an image and text as inputs, and produce a text output. This would be particularly useful for multi-modal tasks such as visual question answering (VQA), image captioning, or image-based text generation.
Motivation
Your contribution
Transformers Integration Ensure that the pipeline works well within the Hugging Face Transformers library:
ImageTextToTextPipeline
).