langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
51.76k stars 7.49k forks source link

Expand the management function of the dataset - Narrowing down the search scope of the dataset based on paths and more metadata #10170

Open glacierck opened 2 weeks ago

glacierck commented 2 weeks ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

We are developing a dynamic knowledge base system that implements a document question-answering function based on the RAG (Retrieve-and-Generate) mechanism. To enhance retrieval efficiency and accuracy, we need to be able to specify the search scope for vector retrieval according to the path input by the user during runtime, rather than blindly searching within the entire dataset. The scenario I evaluated has 100000 documents, and vector recall is only a mechanism for recalling top-n, which leads to performance issues and unreliable accuracy in conducting full retrieval at such a scale

Specific Requirements:

Vector Retrieval within a Specified Path Range:

Users should be able to input a specific file path, such as '/dataset-1/dir-1/a.docx'. The system needs to be able to perform vector retrieval within the specified range (i.e., the file or the directory and its subdirectories) based on this path.

Multi-level Precise Q&A Search:

Desired Outcomes:

By implementing the retrieval function within a specified path range, users can more flexibly control the scope and precision of their searches, thereby obtaining more efficient search results in different scenarios. This function should significantly enhance the Q&A performance and user experience of the dynamic knowledge base system.

2. Additional context or comments

https://github.com/langgenius/dify/pull/5928 : his issue provides a solution, but I think cross dataset retrieval may be unfriendly to the management of dataset lists. The same effect can be achieved by implementing directory management dimensions within the dataset.

3. Can you help us with this feature?

crazywoola commented 2 weeks ago

You can discuss with @Yawen-1010 She is the PM of the RAG.

glacierck commented 1 week ago

@Yawen-1010 Is there a development plan or design blueprint for the dataset? I would like to know about the functionality of dataset management and fragment retrieval. Our team is considering fully embracing DIY as an AI agent. Currently, the dataset related functions are in a primitive state. We hope to join you and promote this branch.

glacierck commented 1 week ago

@Yawen-1010 Anyway, in the workflow, the retrieval of datasets only accepts the input parameter 'query', which is too weak.

ZYW-Mia commented 1 week ago

Hi, @glacierck . Thank you for your request, which is clear in goal, specific in scenario, and detailed in description. We have received many similar requests, and we are currently designing the metadata function of the knowledge base to solve this problem. We can discuss and communicate on this issue.