run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.13k stars 5.33k forks source link

[Question]: How to design a RAG system for a markdown file with pictures? #10307

Closed xjw1001002 closed 10 months ago

xjw1001002 commented 10 months ago

Question Validation

Question

I would like to know, if I have a well formatted markdown document, with pictures inserted using markdown syntax, and all pictures are provided.

Then how can I build a multimodal RAG system?

So for example, if I ask a question, the RAG system retrieve 5 chunks, and also will retrieve related pictures, and then query based on the chunks with GPT4 and pictures using GPT4V.

dosubot[bot] commented 10 months ago

🤖

Hello @xjw1001002! Great to see you using LlamaIndex again. I hope your day's going well!

To develop a multimodal RAG system that retrieves relevant text chunks and related images from a well-formatted markdown document, and then queries these using GPT4 and GPT4V, you can follow the steps below:

  1. Markdown Parsing: First, you need to parse the markdown document to extract the text chunks and the image URLs. You can use a markdown parser like Python-Markdown or Mistune. These libraries can convert markdown to HTML, and then you can use an HTML parser like BeautifulSoup to extract the text and image URLs.

  2. Text and Image Retrieval: Once you have the text chunks and image URLs, you can use the OpenAIMultiModal class provided in the LlamaIndex repository to retrieve the relevant text chunks and images. Here is an example of how you can use this class:

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.llms.types import ChatMessage
from llama_index.schema import ImageDocument

# Initialize the OpenAIMultiModal class
multi_modal = OpenAIMultiModal(
    model="gpt-4-vision-preview",  # The Multi-Modal model to use from OpenAI
    temperature=0.5,  # The temperature to use for sampling
    max_new_tokens=300,  # The maximum numbers of tokens to generate
    image_detail="low",  # The level of details for image in API calls
    api_key="your_openai_api_key",  # The OpenAI API key
    api_base="https://api.openai.com",  # The base URL for OpenAI API
    api_version="v1"  # The API version for OpenAI API
)

# Define your text prompt and image documents
prompt = "Your text prompt here"
image_documents = [ImageDocument(url="url_to_your_image")]

# Use the complete method to get a response
response = multi_modal.complete(prompt, image_documents)

# Print the response text
print(response.text)

In this example, replace "Your text prompt here" with the text chunk you want to query, and "url_to_your_image" with the URL of the image you want to query. The complete method will return a CompletionResponse object, and you can access the generated text with response.text.

Please note that you need to replace "your_openai_api_key" with your actual OpenAI API key.

  1. RAG System: The LlamaIndex repository contains classes specifically designed for handling RAG (Retrieval-Augmented Generation) systems. These classes are RagExamplePrediction, LabelledRagDataExample, RagPredictionDataset, and LabelledRagDataset. You can use these classes to store the user query, the contexts used for the response, and the reference answer to the query, and to predict RAG examples with a query engine.

Please note that this is a basic example and does not include error handling or other considerations you might need in a production environment. You can find more details in the LlamaIndex repository.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nerdai commented 10 months ago

In addition to dosu-bot's suggestion you should take a look at our multi-modal docs if not done so already.

Our multi-modal RAG builds facilitate both image and text retrievals for a given query.