langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.06k stars 6.5k forks source link

Using the API to update the knowledge base, the recall of extremely inaccurate issues. #8384

Open shanyang131 opened 1 week ago

shanyang131 commented 1 week ago

Self Checks

Dify version

0.8.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

1-- Our company updates the database by calling an API, updating every two hours. 2--After each update, there is a situation where the recall is extremely inaccurate. As shown in Actual Behavior. 3--However, sometimes adjusting the index mode and retrieval settings in the configuration makes it better. We have tried both high quality and economical options, and no matter which one, sometimes it becomes very useful. We have tried in various situations and have not found any pattern. Sometimes it gets better inexplicably without any adjustments. 4--The document segmentation is as shown in the illustration of Steps to Reproduce. Each paragraph corresponds to an independent collection item. 5--Each input is the name of the collection item, and the only hope is to recall the corresponding collection item. 6--The desired recall is as shown in Expected Behavior. 7--I don't know where the problem is, and there are no error logs. It feels like it might be a bug. I hope to receive a reply, either here or by emailing shanyangyang7@gmail.com.

截屏2024-09-13 21 02 00

✔️ Expected Behavior

截屏2024-09-13 21 06 21

❌ Actual Behavior

截屏2024-09-13 20 45 37
dosubot[bot] commented 1 week ago

@shanyang131 Hey there! 👋 I'm here to help you with bug fixes, questions, and even becoming a contributor. Let's squash those bugs together!

I found some similar issues and discussions that might be relevant to your problem:

  1. Use the knowledge base API to adding a segment of a document, will not be automatically indexed? (Closed)

    • Solution: The segment added using the knowledge base API is not automatically indexed. You need to explicitly enable the segment for indexing by sending a PATCH request. This will trigger the indexing process.
      curl --location --request PATCH 'http://10.4.127.135/v1/datasets/c8674072-2799-4585-95f0-976746f8e0bb/segments/aa2316d3-233c-45be-a294-8526809f0f74/enable' \
      --header 'Authorization: Bearer dataset-dUyXKXp2TtFsWYo1UdYufpKI' \
      --header 'Content-Type: application/json'

      [1]

  2. The overlapping length of the segments seems to be ineffective. Please help, thank you (Closed)

    • Solution: Ensure that the chunk_overlap value is correctly set and passed in your code. Updating to the latest version of the code should also resolve the problem. [2]

To improve the accuracy of recall when updating the knowledge base using the API, consider the following strategies:

  1. Text Preprocessing and Cleaning:

    • Ensure proper chunking and cleaning of text. Use custom chunking rules to define delimiters, maximum chunk length, and chunk overlap.
    • Apply text preprocessing rules to filter out insignificant content, such as replacing consecutive spaces, newlines, and tabs, and deleting all URLs and email addresses.
  2. Indexing Mode:

    • Choose the appropriate indexing mode. High-Quality Mode offers vector retrieval, full-text retrieval, and hybrid retrieval, which can enhance the accuracy of LLM-user interactions.
    • Use Q&A Mode for scenarios involving high-frequency and highly similar queries.
  3. Retrieval Settings:

    • Configure retrieval settings to optimize recall. Enable the Rerank Model to perform semantic reordering of the retrieved document results.
    • Adjust the TopK and Score Threshold settings to filter the most similar text chunks to the user's question.
    • Use Hybrid Search to combine full-text search and vector search, and set custom weights for semantic and keyword priorities.
  4. Retrieval Testing:

    • Utilize the text retrieval testing feature to debug the recall effects under different retrieval methods and parameter configurations. Adjust retrieval methods and parameters during the recall testing process to find the optimal settings.
  5. Citation and Attribution:

    • Enable the citation attribution feature to view specific citation paragraph information below the response content, including original segment text, segment number, and matching degree.
  6. Multi-path Retrieval Mode:

    • Use the Multi-path Retrieval mode to scan all knowledge bases linked to the application for relevant text content and consolidate the results. Configure the Rerank strategy to refine the sorting of retrieved content.

Here is an example configuration for these settings in your API:

{
  "indexing_mode": "High-Quality",
  "retrieval_settings": {
    "type": "Hybrid",
    "weights": {
      "semantic": 0.7,
      "keyword": 0.3
    },
    "rerank_model": {
      "enabled": true,
      "top_k": 5,
      "score_threshold": 0.6
    }
  },
  "chunking_rules": {
    "delimiter": "\\n",
    "max_chunk_length": 4000,
    "chunk_overlap": 0.2
  },
  "preprocessing_rules": {
    "replace_consecutive_spaces": true,
    "delete_urls": true,
    "delete_emails": true
  }
}

By carefully configuring these settings and utilizing retrieval testing, citation features, and the Multi-path Retrieval mode, you can improve the accuracy of recall in your knowledge base [3][4][5].

To continue talking to Dosu, mention @dosu.

shanyang131 commented 1 week ago

curl --location --request PATCH 'http://10.4.127.135/v1/datasets/c8674072-2799-4585-95f0-976746f8e0bb/segments/aa2316d3-233c-45be-a294-8526809f0f74/enable' \ --header 'Authorization: Bearer dataset-dUyXKXp2TtFsWYo1UdYufpKI' \ --header 'Content-Type: application/json'

Does this API have documentation? @JohnJyong

shanyang131 commented 1 week ago

curl --location --request PATCH ' http://10.4.127.135/v1/datasets/c8674072-2799-4585-95f0-976746f8e0bb/segments/aa2316d3-233c-45be-a294-8526809f0f74/enable' \ --header 'Authorization: Bearer dataset-dUyXKXp2TtFsWYo1UdYufpKI' \ --header 'Content-Type: application/json'

hi,Does this API have documentation?

On Fri, Sep 13, 2024 at 9:11 PM dosubot[bot] @.***> wrote:

@shanyang131 https://github.com/shanyang131 Hey there! 👋 I'm here to help you with bug fixes, questions, and even becoming a contributor. Let's squash those bugs together!

I found some similar issues and discussions that might be relevant to your problem:

1.

Use the knowledge base API to adding a segment of a document, will not be automatically indexed? https://github.com/langgenius/dify/issues/4597 (Closed)

To improve the accuracy of recall when updating the knowledge base using the API, consider the following strategies:

1.

Text Preprocessing and Cleaning:

  • Ensure proper chunking and cleaning of text. Use custom chunking rules to define delimiters, maximum chunk length, and chunk overlap.

    • Apply text preprocessing rules to filter out insignificant content, such as replacing consecutive spaces, newlines, and tabs, and deleting all URLs and email addresses. 2.

    Indexing Mode:

  • Choose the appropriate indexing mode. High-Quality Mode offers vector retrieval, full-text retrieval, and hybrid retrieval, which can enhance the accuracy of LLM-user interactions.

    • Use Q&A Mode for scenarios involving high-frequency and highly similar queries. 3.

    Retrieval Settings:

  • Configure retrieval settings to optimize recall. Enable the Rerank Model to perform semantic reordering of the retrieved document results.

    • Adjust the TopK and Score Threshold settings to filter the most similar text chunks to the user's question.
    • Use Hybrid Search to combine full-text search and vector search, and set custom weights for semantic and keyword priorities. 4.

    Retrieval Testing:

  • Utilize the text retrieval testing feature to debug the recall effects under different retrieval methods and parameter configurations. Adjust retrieval methods and parameters during the recall testing process to find the optimal settings. 5.

    Citation and Attribution:

  • Enable the citation attribution feature to view specific citation paragraph information below the response content, including original segment text, segment number, and matching degree. 6.

    Multi-path Retrieval Mode:

  • Use the Multi-path Retrieval mode to scan all knowledge bases linked to the application for relevant text content and consolidate the results. Configure the Rerank strategy to refine the sorting of retrieved content.

Here is an example configuration for these settings in your API:

{ "indexing_mode": "High-Quality", "retrieval_settings": { "type": "Hybrid", "weights": { "semantic": 0.7, "keyword": 0.3 }, "rerank_model": { "enabled": true, "top_k": 5, "score_threshold": 0.6 } }, "chunking_rules": { "delimiter": "\n", "max_chunk_length": 4000, "chunk_overlap": 0.2 }, "preprocessing_rules": { "replace_consecutive_spaces": true, "delete_urls": true, "delete_emails": true } }

By carefully configuring these settings and utilizing retrieval testing, citation features, and the Multi-path Retrieval mode, you can improve the accuracy of recall in your knowledge base [3] https://github.com/langgenius/dify-docs/blob/main/en/guides/knowledge-base/create-knowledge-and-upload-documents.md [4] https://github.com/langgenius/dify-docs/blob/main/en/guides/knowledge-base/retrieval-test-and-citation.md [5] https://github.com/langgenius/dify-docs/blob/main/en/guides/knowledge-base/integrate-knowledge-within-application.md .

To continue talking to Dosu https://dosu.dev, mention @dosu https://github.com/dosu.

— Reply to this email directly, view it on GitHub https://github.com/langgenius/dify/issues/8384#issuecomment-2348929586, or unsubscribe https://github.com/notifications/unsubscribe-auth/BICN3PFH6OMPGS4WGMYF2OTZWLP67AVCNFSM6AAAAABOFLCYY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBYHEZDSNJYGY . You are receiving this because you were mentioned.Message ID: @.***>

dosubot[bot] commented 1 week ago

Yes, this API has documentation. The Dify documentation repository includes comprehensive instructions on how to set up and use the Dify Backend API, including starting the docker-compose stack, setting up the environment, installing dependencies, running migrations, and starting the backend service. Additionally, the Dify Knowledge Base API offers a standard set of APIs for CRUD operations on documents and chunks within the knowledge base. For more detailed information, you can refer to the Knowledge Base API documentation [1][2][3].

This is a tough nut to crack, could really use your help on this, @takatost.

To continue talking to Dosu, mention @dosu.