langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
43.12k stars 5.98k forks source link

Error in text segmentation and cleaning of dify local knowledge base #6188

Closed RR0810 closed 1 day ago

RR0810 commented 1 month ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I tried to segment and clean the text of the uploaded local knowledge base, and selected the custom mode. There was a segment identifier, and the maximum length of the segment was set to 1000. However, the result was segmented in places that did not meet 1000, and a sentence was Separate incomplete words.

2. Additional context or comments

No response

3. Can you help us with this feature?

dosubot[bot] commented 1 month ago

Hey @RR0810! 👋 I'm here to assist you with bugs, questions, and becoming a contributor. I'm Dosu, a bot designed to help while you wait for a human maintainer.

To resolve the issue of text segments not adhering to the specified 1000-character limit and splitting words when using the custom mode in the Dify application, you need to configure the segmentation rules properly. Specifically, you should set the "Maximum segment length" to 1000 characters and ensure that the "Segmentation identifier" is appropriately set to avoid splitting words.

Here is how you can configure it:

  1. Maximum segment length: Set this to 1000 characters to ensure that no segment exceeds this length.
  2. Segmentation identifier: Set an identifier such as "\n" (newline) to segment the text at appropriate points, which helps in avoiding splitting words.

Example configuration:

segmentation_rules:
  segmentation_identifier: "\n"
  maximum_segment_length: 1000
  segment_overlap_length: 100  # Optional, set overlap to retain semantic relevance

Additionally, you can use the Dify API to create documents with custom segmentation rules. Here is an example of how to create a document by text using the API:

curl --location --request POST 'https://api.dify.ai/v1/datasets/<uuid:dataset_id>/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "Dify",
    "text": "Dify means Do it for you...",
    "indexing_technique": "high_quality",
    "process_rule": {
        "rules": {
                "pre_processing_rules": [{
                        "id": "remove_extra_spaces",
                        "enabled": true
                }, {
                        "id": "remove_urls_emails",
                        "enabled": true
                }],
                "segmentation": {
                        "separator": "###",
                        "max_tokens": 500
                }
        },
        "mode": "custom"
    }
}'

By setting these rules and using the API, the system will segment the text at specified points and ensure that each segment does not exceed 1000 characters, thus preventing the splitting of words [1][2][3].

To continue talking to Dosu, mention @dosu.

crazywoola commented 1 month ago

@JohnJyong Please take a look at this