Modifying document segmenting in Question & Answer format leads to all data being erased

dc-vin commented 2 months ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.7.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Modifying document segmenting in Question & Answer format leads to all data being erased By the way, have you considered supporting the setting of QA mode when creating a document through an API?

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 2 months ago

I found a similar closed issue regarding the Q&A segmentation mode, where segmentation using the option "simplified Chinese" generated a large number of English segments. The suggested solution was to emphasize the use of language above the prompt and to check out the latest code ^[1].

Additionally, there is an existing API endpoint for setting the QA mode when creating a document. You can set the process_rule parameter in the request body to include a mode field that can be set to automatic or custom. Here are examples for creating a document from text and from a file:

For text:

curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_text' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "text",
    "text": "text",
    "indexing_technique": "high_quality",
    "process_rule": {
        "mode": "automatic"
    }
}'

For a file:

curl --location --request POST '${props.apiBaseUrl}/datasets/{dataset_id}/document/create_by_file' \
--header 'Authorization: Bearer {api_key}' \
--form 'data="{\"name\":\"Dify\",\"indexing_technique\":\"high_quality\",\"process_rule\":{\"rules\":{\"pre_processing_rules\":[{\"id\":\"remove_extra_spaces\",\"enabled\":true},{\"id\":\"remove_urls_emails\",\"enabled\":true}],\"segmentation\":{\"separator\":\"###\",\"max_tokens\":500}},\"mode\":\"custom\"}}";type=text/plain' \
--form 'file=@"/path/to/file"'

^[2]

To continue talking to Dosu, mention @dosu.

crazywoola commented 2 months ago

By the way, have you considered supporting the setting of QA mode when creating a document through an API?

I think this is a good idea, maybe you can open a pr to fulfill this feature, because we are a little bit busy these days.

crazywoola commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

dc-vin commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

A colleague modified the QA mode a couple of days ago, which resulted in the deletion of long-term maintained data without any backup. This led to criticism from our supervisor. I suggest notifying users that their data will be erased when modifying the mode.

dc-vin commented 2 months ago

By the way, have you considered supporting the setting of QA mode when creating a document through an API?

I think this is a good idea, maybe you can open a pr to fulfill this feature, because we are a little bit busy these days.

If I have time this month, I will try to open a PR.

dc-vin commented 2 months ago

This 'erased' behavior is expected, because we will let LLM to split original texts into Q&A pairs.

I discovered a new bug. when I set the QA mode and change the language type, it automatically reverts to English after saving.

langgenius / dify