langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.67k stars 6.59k forks source link

Best Approach to Build a Multimodal RAG Application? #8001

Closed taowang1993 closed 3 weeks ago

taowang1993 commented 3 weeks ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I found an article from Nvidia talking about how to build a multimodal RAG.

I wonder how Dify plans to improve RAG pipelines in the future.

Currently, I am using Jina Clip v1 to embed PDF (text + images) into Milvus.

In this article, Nvidia proposes 3 approaches to build multimodal apps.

  1. embed both text and images into one single vector database
  2. convert images into text and embed only text into vector database
  3. embed images into one vector database, embed text into another vector database

https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation/

2. Additional context or comments

No response

3. Can you help us with this feature?

dosubot[bot] commented 3 weeks ago

I found a similar open issue that might be relevant to your question: Multimodal Embedding. This issue was last updated on September 01, 2024, and currently has no comments or solutions [1].

Regarding the best approach to build a multimodal Retrieval-Augmented Generation (RAG) application, the recommended method involves using users' questions to match the most relevant external data, retrieving the relevant content, and then reorganizing and inserting the response back as the context of the model prompt. A hybrid search approach, which combines vector search and keyword search, can be particularly effective. Vector search excels in understanding semantic relationships, multilingual understanding, and multimodal understanding, while keyword search is strong in exact matching, handling abbreviations, and low-frequency words. Combining these methods ensures the most relevant results are retrieved [2][3][4][5].

Dify plans to improve RAG pipelines in the future by visualizing each step in the RAG pipeline, providing a simple and easy-to-use user interface to help application builders manage personal or team knowledge bases, and quickly integrating them into AI applications. They are also gradually supporting synchronizing data from various data sources to datasets, including web pages, Notion, Github, and databases. Additionally, Dify offers extensive RAG capabilities, covering everything from document ingestion to retrieval, with out-of-the-box support for extracting text from common document formats like PDF and PPT. The hybrid search capability in Dify further enhances the retrieval process by combining the strengths of vector and keyword searches, ensuring comprehensive and accurate retrieval results [2][3][4][5].

Currently, Dify handles multimodal data in its RAG pipeline by supporting optional ETL (Extract, Transform, Load) solutions to preprocess and clean multi-source data. Specifically, Dify supports two ETL solutions: Dify ETL and Unstructured ETL. The choice of ETL solution depends on the version of Dify being used. Additionally, Dify employs a re-ranking model to further refine the results of the retrieval process, improving the quality and relevance of the retrieved information [6][7].

For embedding images and text into a single vector database, Dify supports this through the PGVector class, which handles the creation and management of vector embeddings for documents. The create and add_texts methods in the PGVector class are used to embed documents (which can include text) and store them in the vector database [8][9][6][10].

I hope this helps! If you have any more questions or need further assistance, feel free to ask.

To continue talking to Dosu, mention @dosu.