[Question]: How can I create a ImageDocument directly from a binary string or base64 encoded string?

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I'm trying to use OpenAIMultiModal class to process a few images that I found in a PDF. The image data are stored as a tentative variable either in a binary string or a base64 encoded string. However, from the official document, it seems that we need to read images from a url or a local file to create an ImageDocument that can be processed by OpenAIMultiModal.

I cannot save the image to disk and read it back due to disk space concerns. Is there any way to directly create an ImageDocument from a binary string? Or is it possible to call the complete function in OpenAIMultiModal with an image string? Currently I'm using the following code as an alternative solution using vanilla OpenAI API call. How can I convert it into llama_index wrapper?

def gpt4_image_summarization(encoded_image: str) -> (str, int):
    """
    Description: Summarize the image (in utf-8 encoded string) using GPT-4-vision API.
    """
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {settings.OPENAI_API_KEY}",
    }

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """
            What is in this image? Please give me a summary of the image with no more than 3 sentences.
            If you identify the image as a diagram, please list out all components of the diagram.
            """,
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encoded_image}"
                        },
                    },
                ],
            }
        ],
        "max_tokens": 300,
    }

    response = requests.post(
        "https://api.openai.com/v1/chat/completions", headers=headers, json=payload
    )
    if "choices" in response.json():
        return response.json()["choices"][0]["message"]["content"], response.status_code
    else:
        return f"No summary is generated.\n {response.json()}", response.status_code

🤖

Hey there @mw19930312! 🦙 Long time no chat. Hope you're doing fantastic and still exploring the endless possibilities with images. What's cooking in the tech kitchen today?

Based on the context provided, it seems that you can indeed create an ImageDocument directly from a base64 encoded string in LlamaIndex. This is possible through the ImageNode class, which has an attribute image that accepts a base64 encoded image string. The resolve_image method of the ImageNode class then decodes this base64 string into bytes that can be read by PIL (Python Imaging Library).

Here is the relevant code snippet from the LlamaIndex repository:

class ImageNode(TextNode):
    """Node with image."""

    # base64 encoded image str
    image: Optional[str] = None

    def resolve_image(self) -> ImageType:
        """Resolve an image such that PIL can read it."""
        if self.image is not None:
            import base64

            return BytesIO(base64.b64decode(self.image))

You can find this in the llama_index/schema.py file.

However, there is currently no direct support for creating an ImageDocument from a binary string. You would need to convert the binary string to a base64 encoded string before creating the ImageDocument.

As for your question about the complete function in OpenAIMultiModal, I'm afraid I don't have enough information to provide a definitive answer. Could you please provide more details about the function and how you're planning to use it? This would help me give a more accurate response.

Sources

llama_index/schema.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Assuming you get the base64 as a string

from llama_index.schema import ImageDocument

document = ImageDocument(text="optional", image=b64_string)

from llama_index.multi_modal_llms.openai import OpenAIMultiModal

llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=300
)

response = llm.complete("Tell me about this image", image_documents=[document])

run-llama / llama_index