Best Approach to Build a Multimodal RAG Application?

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I found an article from Nvidia talking about how to build a multimodal RAG.

I wonder how Dify plans to improve RAG pipelines in the future.

Currently, I am using Jina Clip v1 to embed PDF (text + images) into Milvus.

In this article, Nvidia proposes 3 approaches to build multimodal apps.

embed both text and images into one single vector database
convert images into text and embed only text into vector database
embed images into one vector database, embed text into another vector database

https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/

2. Additional context or comments

No response

3. Can you help us with this feature?

[ ] I am interested in contributing to this feature.

I found a similar open issue that might be relevant to your question: Multimodal Embedding. This issue was last updated on September 01, 2024, and currently has no comments or solutions ^[1].

Regarding the best approach to build a multimodal Retrieval-Augmented Generation (RAG) application, the recommended method involves using users' questions to match the most relevant external data, retrieving the relevant content, and then reorganizing and inserting the response back as the context of the model prompt. A hybrid search approach, which combines vector search and keyword search, can be particularly effective. Vector search excels in understanding semantic relationships, multilingual understanding, and multimodal understanding, while keyword search is strong in exact matching, handling abbreviations, and low-frequency words. Combining these methods ensures the most relevant results are retrieved ^[2]^[3]^[4]^[5].

Dify plans to improve RAG pipelines in the future by visualizing each step in the RAG pipeline, providing a simple and easy-to-use user interface to help application builders manage personal or team knowledge bases, and quickly integrating them into AI applications. They are also gradually supporting synchronizing data from various data sources to datasets, including web pages, Notion, Github, and databases. Additionally, Dify offers extensive RAG capabilities, covering everything from document ingestion to retrieval, with out-of-the-box support for extracting text from common document formats like PDF and PPT. The hybrid search capability in Dify further enhances the retrieval process by combining the strengths of vector and keyword searches, ensuring comprehensive and accurate retrieval results ^[2]^[3]^[4]^[5].

Currently, Dify handles multimodal data in its RAG pipeline by supporting optional ETL (Extract, Transform, Load) solutions to preprocess and clean multi-source data. Specifically, Dify supports two ETL solutions: Dify ETL and Unstructured ETL. The choice of ETL solution depends on the version of Dify being used. Additionally, Dify employs a re-ranking model to further refine the results of the retrieval process, improving the quality and relevance of the retrieved information ^[6]^[7].

For embedding images and text into a single vector database, Dify supports this through the PGVector class, which handles the creation and management of vector embeddings for documents. The create and add_texts methods in the PGVector class are used to embed documents (which can include text) and store them in the vector database ^[8]^[9]^[6]^[10].

I hope this helps! If you have any more questions or need further assistance, feel free to ask.

To continue talking to Dosu, mention @dosu.

langgenius / dify