langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.99k stars 7.34k forks source link

Modifying document segmenting in Question & Answer format leads to all data being erased #7977

Closed dc-vin closed 2 weeks ago

dc-vin commented 2 months ago

Self Checks

Dify version

0.7.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Modifying document segmenting in Question & Answer format leads to all data being erased image By the way, have you considered supporting the setting of QA mode when creating a document through an API?

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 2 months ago

I found a similar closed issue regarding the Q&A segmentation mode, where segmentation using the option "simplified Chinese" generated a large number of English segments. The suggested solution was to emphasize the use of language above the prompt and to check out the latest code [1].

Additionally, there is an existing API endpoint for setting the QA mode when creating a document. You can set the process_rule parameter in the request body to include a mode field that can be set to automatic or custom. Here are examples for creating a document from text and from a file:

For text:

curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "text",
    "text": "text",
    "indexing_technique": "high_quality",
    "process_rule": {
        "mode": "automatic"
    }
}'

For a file:

curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

[2]

To continue talking to Dosu, mention @dosu.

crazywoola commented 2 months ago

By the way, have you considered supporting the setting of QA mode when creating a document through an API?

I think this is a good idea, maybe you can open a pr to fulfill this feature, because we are a little bit busy these days.

crazywoola commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

dc-vin commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

A colleague modified the QA mode a couple of days ago, which resulted in the deletion of long-term maintained data without any backup. This led to criticism from our supervisor. I suggest notifying users that their data will be erased when modifying the mode.

dc-vin commented 2 months ago

By the way, have you considered supporting the setting of QA mode when creating a document through an API?

I think this is a good idea, maybe you can open a pr to fulfill this feature, because we are a little bit busy these days.

If I have time this month, I will try to open a PR.

dc-vin commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

I discovered a new bug. when I set the QA mode and change the language type, it automatically reverts to English after saving.