Closed xjw1001002 closed 10 months ago
🤖
Hello @xjw1001002! Great to see you using LlamaIndex again. I hope your day's going well!
To develop a multimodal RAG system that retrieves relevant text chunks and related images from a well-formatted markdown document, and then queries these using GPT4 and GPT4V, you can follow the steps below:
Markdown Parsing: First, you need to parse the markdown document to extract the text chunks and the image URLs. You can use a markdown parser like Python-Markdown or Mistune. These libraries can convert markdown to HTML, and then you can use an HTML parser like BeautifulSoup to extract the text and image URLs.
Text and Image Retrieval: Once you have the text chunks and image URLs, you can use the OpenAIMultiModal
class provided in the LlamaIndex repository to retrieve the relevant text chunks and images. Here is an example of how you can use this class:
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.llms.types import ChatMessage
from llama_index.schema import ImageDocument
# Initialize the OpenAIMultiModal class
multi_modal = OpenAIMultiModal(
model="gpt-4-vision-preview", # The Multi-Modal model to use from OpenAI
temperature=0.5, # The temperature to use for sampling
max_new_tokens=300, # The maximum numbers of tokens to generate
image_detail="low", # The level of details for image in API calls
api_key="your_openai_api_key", # The OpenAI API key
api_base="https://api.openai.com", # The base URL for OpenAI API
api_version="v1" # The API version for OpenAI API
)
# Define your text prompt and image documents
prompt = "Your text prompt here"
image_documents = [ImageDocument(url="url_to_your_image")]
# Use the complete method to get a response
response = multi_modal.complete(prompt, image_documents)
# Print the response text
print(response.text)
In this example, replace "Your text prompt here"
with the text chunk you want to query, and "url_to_your_image"
with the URL of the image you want to query. The complete
method will return a CompletionResponse
object, and you can access the generated text with response.text
.
Please note that you need to replace "your_openai_api_key"
with your actual OpenAI API key.
RagExamplePrediction
, LabelledRagDataExample
, RagPredictionDataset
, and LabelledRagDataset
. You can use these classes to store the user query, the contexts used for the response, and the reference answer to the query, and to predict RAG examples with a query engine.Please note that this is a basic example and does not include error handling or other considerations you might need in a production environment. You can find more details in the LlamaIndex repository.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
In addition to dosu-bot's suggestion you should take a look at our multi-modal docs if not done so already.
Our multi-modal RAG builds facilitate both image and text retrievals for a given query.
Question Validation
Question
I would like to know, if I have a well formatted markdown document, with pictures inserted using markdown syntax, and all pictures are provided.
Then how can I build a multimodal RAG system?
So for example, if I ask a question, the RAG system retrieve 5 chunks, and also will retrieve related pictures, and then query based on the chunks with GPT4 and pictures using GPT4V.