langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.4k stars 6.55k forks source link

The knowledge base paging query is confused #8623

Open lvxinliang opened 3 days ago

lvxinliang commented 3 days ago

Self Checks

Dify version

0.6.16

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When multiple documents are upload at the same time, confusion occurs when the Document is queried

✔️ Expected Behavior

There are no duplicate records for different pages

❌ Actual Behavior

Different pages have the same result

dosubot[bot] commented 3 days ago

I found a similar open issue and discussion that might be relevant to your problem:

  1. Can dify support using knowledge base via Annual Report or Annual Document of many different year - This discussion is still open and might provide some insights.
  2. Error in using knowledge base recall test - This issue is closed but was referenced by another issue related to validation errors.

To resolve the issue of different pages having the same result when querying multiple documents in the knowledge base using Dify version 0.6.16, you should review and configure the retrieval strategy appropriately. Specifically, you need to ensure that the correct recall strategy is being used.

Dify offers two recall strategies:

  1. N-to-1 Recall: This mode executes knowledge base queries through function calling and requires the selection of a system reasoning model.
  2. Multi-way Recall: This mode requires configuring a Rerank model for result re-ranking.

Steps to Configure Retrieval Strategy

  1. Create Knowledge Base & Upload Documents:

    • Segment and clean the document, and preview the effect.
    • Choose and configure Index Mode and Retrieval Settings.
    • Wait for the chunks to be embedded.
  2. Text Preprocessing and Cleaning:

    • Use either Automatic or Custom mode for chunking and cleaning the text.
    • Ensure the text is properly segmented and cleaned to improve recall accuracy.
  3. Indexing Mode:

    • Choose the appropriate indexing method (High-Quality, Economical, or Q&A Mode).
    • Configure the retrieval settings based on the indexing method.
  4. Retrieval Settings:

    • In High-Quality indexing mode, choose between Vector Search, Full-Text Search, or Hybrid Search.
    • Configure the Rerank Model, TopK, and Score Threshold settings to optimize retrieval results.

Connecting Knowledge and Setting Retrieval Mode

In applications that utilize multiple knowledge bases, it is essential to configure the retrieval mode to enhance the precision of retrieved content. To set the retrieval mode for the knowledge bases, navigate to Context -- Retrieval Settings -- Rerank Setting.

The retriever scans all knowledge bases linked to the application for text content relevant to the user's question. The results are then consolidated. Below is the technical flowchart for the Multi-path Retrieval mode:

This method simultaneously queries all knowledge bases connected in "Context", seeking relevant text chunks across multiple knowledge bases, collecting all content that aligns with the user's question, and ultimately applying the Rerank strategy to identify the most appropriate content to respond to the user. This retrieval approach offers more comprehensive and accurate results by leveraging multiple knowledge bases simultaneously.

For instance, in application A, with three knowledge bases K1, K2, and K3. When a user sends a question, multiple relevant pieces of content will be retrieved and combined from these knowledge bases. To ensure the most pertinent content is identified, the Rerank strategy is employed to find the content that best relates to the user's query, enhancing the precision and reliability of the results.

In practical Q&A scenarios, the sources of content and retrieval methods for each knowledge base may differ. To manage the mixed content returned from retrieval, the Rerank strategy acts as a refined sorting mechanism. It ensures that the candidate content aligns well with the user's question, optimizing the ranking of results across multiple knowledge bases to identify the most suitable content, thereby improving answer quality and overall user experience.

Considering the costs associated with using Rerank and the needs of the business, the multi-path retrieval mode provides two Rerank settings:

Weighted Score

This setting uses internal scoring mechanisms and does not require an external Rerank model, thus avoiding any additional processing costs. You can select the most appropriate content matching strategy by adjusting the weight ratio sliders for semantics or keywords.

Rerank Model

The Rerank model is an external scoring system that calculates the relevance score between the user's question and each candidate document provided, improving the results of semantic ranking and returning a list of documents sorted by relevance from high to low.

While this method incurs some additional costs, it is more adept at handling complex knowledge base content, such as content that combines semantic queries and keyword matches, or cases involving multilingual returned content.

Click here to learn more about the Re-ranking.

Dify currently supports multiple Rerank models. To use external Rerank models, you'll need to provide an API Key. Enter the API Key for the Rerank model (such as Cohere, Jina, etc.) on the "Model Provider" page.

Configuring the Rerank model in the Model Provider

Adjustable Parameters

The multi-recall mode can achieve higher quality recall results when retrieving from multiple knowledge bases; therefore, it is recommended to set the recall mode to multi-recall.

By ensuring the correct configuration of these settings, you should be able to resolve the issue of different pages having the same result when querying multiple documents [1][2][3][4].

To continue talking to Dosu, mention @dosu.

lvxinliang commented 3 days ago

The new version has been resolved: https://github.com/langgenius/dify/commit/2fe2e350ce55c85cf63d94fd3f5596bd781f7f17

crazywoola commented 3 days ago

Please close the issue if this is resolved.