Closed Prem-Nitin closed 2 weeks ago
Hey @Prem-Nitin! Great to see you back with another intriguing question. Let's dive into it. 🚀
Based on your requirements, it seems like you're looking for a way to implement a multi-modal Retrieval-Augmented Generation (RAG) system that can handle both text and images. This is indeed possible with the LlamaIndex repository.
The LlamaIndex repository provides a class called OpenAIMultiModal
which is designed to interact with OpenAI's API, specifically targeting multi-modal capabilities (i.e., handling both text and images). This class extends the MultiModalLLM
class, indicating it's part of a system designed to work with multi-modal Large Language Models (LLMs). You can use this class to retrieve the relevant text chunks and images.
Here's a high-level overview of how you can use it:
Parse the markdown document to extract text chunks and image URLs using a markdown parser like Python-Markdown or Mistune, and then use an HTML parser like BeautifulSoup to extract the text and image URLs.
Initialize the OpenAIMultiModal
class with the appropriate parameters, define your text prompt and image documents, and use the complete
method to get a response.
Utilize the RAG system classes in the LlamaIndex repository, such as RagExamplePrediction
, LabelledRagDataExample
, RagPredictionDataset
, and LabelledRagDataset
, to store the user query, the contexts used for the response, and the reference answer to the query, and to predict RAG examples with a query engine.
You can find more details about the OpenAIMultiModal
class in the source code.
Additionally, you can refer to the multi-modal documentation provided by LlamaIndex, specifically the examples and use cases for multi-modal RAG with CLIP, to facilitate both image and text retrievals for a given query. Here are the relevant links:
I hope this helps! If you have any further questions or need more clarification, feel free to ask.
Question Validation
Question
I have a use case where i got 100's of documents. I have implemented a rag for answering question related to text but the problem is my requirement extends to images also. The documents contains steps for some process. These steps have some text and followed by some image. The application i am trying to implement should behave in a way that, if asked any question about the process it should not only give me the steps but also the images corresponding to it. (have to maintain the order of the images)
For ex:
Step 1: some text respective image for step 1
Step 2: some text respective image for step 2
and so on.
How do you even do this, is it possible?