Full-text entry support for the knowledge base rather than being divided into several paragraphs

jiaqianjing commented 2 months ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

have many articles, but none of them are too long, about more than 1,000 characters. However, I hope they can be retrieved separately instead of being split into several fragments. Because they are all independently and identically distributed. And the current model context has long supported characters at the million-level. There is no need to limit it so strictly (no more than 1,000 for each paragraph). This is very inconvenient and will bring errors, because the retrieved things may be omitted or pieces of different articles may be spliced together, which is not allowed.

2. Additional context or comments

No response

3. Can you help us with this feature?

[ ] I am interested in contributing to this feature.

Sakura4036 commented 2 months ago

I agree

Weaxs commented 2 months ago

I thought you do not want Chinese text segmentation ?

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.
Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

jiaqianjing commented 2 months ago

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词？

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant，非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。

Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit.

Weaxs commented 2 months ago

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词？ If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant，非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。

Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit.

oh, sorry I misunderstood before.

If you want to change max segement tokens length for self-host dify, you can set INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in .env, like:

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH=5000

If you use dify cloud, hmmmm.. maybe ask for @takatost

Weaxs commented 2 months ago

@jiaqianjing you use in cloud or self host ?

jiaqianjing commented 2 months ago

dify cloud @Weaxs

Weaxs commented 1 month ago

dify cloud @Weaxs

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in dify cloud has raised to 4000 tokens already.

but maybe have no plan to raise to 5000 tokens I gusses...

langgenius / dify