[Question]: Multi modal rag

Prem-Nitin commented 6 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I have a use case where i got 100's of documents. I have implemented a rag for answering question related to text but the problem is my requirement extends to images also. The documents contains steps for some process. These steps have some text and followed by some image. The application i am trying to implement should behave in a way that, if asked any question about the process it should not only give me the steps but also the images corresponding to it. (have to maintain the order of the images)

For ex:

Step 1: some text respective image for step 1

Step 2: some text respective image for step 2

and so on.

How do you even do this, is it possible?

dosubot[bot] commented 6 months ago

Hey @Prem-Nitin! Great to see you back with another intriguing question. Let's dive into it. 🚀

Based on your requirements, it seems like you're looking for a way to implement a multi-modal Retrieval-Augmented Generation (RAG) system that can handle both text and images. This is indeed possible with the LlamaIndex repository.

The LlamaIndex repository provides a class called OpenAIMultiModal which is designed to interact with OpenAI's API, specifically targeting multi-modal capabilities (i.e., handling both text and images). This class extends the MultiModalLLM class, indicating it's part of a system designed to work with multi-modal Large Language Models (LLMs). You can use this class to retrieve the relevant text chunks and images.

Here's a high-level overview of how you can use it:

Parse the markdown document to extract text chunks and image URLs using a markdown parser like Python-Markdown or Mistune, and then use an HTML parser like BeautifulSoup to extract the text and image URLs.
Initialize the OpenAIMultiModal class with the appropriate parameters, define your text prompt and image documents, and use the complete method to get a response.
Utilize the RAG system classes in the LlamaIndex repository, such as RagExamplePrediction, LabelledRagDataExample, RagPredictionDataset, and LabelledRagDataset, to store the user query, the contexts used for the response, and the reference answer to the query, and to predict RAG examples with a query engine.

You can find more details about the OpenAIMultiModal class in the source code.

Additionally, you can refer to the multi-modal documentation provided by LlamaIndex, specifically the examples and use cases for multi-modal RAG with CLIP, to facilitate both image and text retrievals for a given query. Here are the relevant links:

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

LJ-Hao commented 3 months ago

Check here, maybe it will help you, I am trying to use clip to embedding images and text in the next version.

run-llama / llama_index