huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.56k stars 26.69k forks source link

Image-Text-to-Text Support in Transformers Pipeline #34169

Open chakravarthik27 opened 3 days ago

chakravarthik27 commented 3 days ago

Feature request

Implement the new feature to support a pipeline that can take both an image and text as inputs, and produce a text output. This would be particularly useful for multi-modal tasks such as visual question answering (VQA), image captioning, or image-based text generation.

from transformers import pipeline

# Initialize the pipeline with multi-modal models
multi_modal_pipeline = pipeline("image-text-to-text", model="meta-llama/Llama-3.2-11B-Vision-Instruct")

# Example usage
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
result = multi_modal_pipeline(messages )
print(result)  # Should return an answer or relevant text based on the image and question

Motivation

Your contribution

Transformers Integration Ensure that the pipeline works well within the Hugging Face Transformers library:

class ImageTextToTextPipeline(Pipeline):
  ....
yonigozlan commented 3 days ago

Good timing ;) https://github.com/huggingface/transformers/pull/34170

NOOB-del-ai commented 1 day ago

I'd like to work on this feature request.