Closed jiaqianjing closed 1 week ago
I agree
I thought you do not want Chinese text segmentation ?
If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:
User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.
Maybe you can update full-text index after create knowledge by Dify ? —— not recommend
I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词?
If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:
- User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
- Maybe you can update full-text index after create knowledge by Dify ? —— not recommend
First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit.
I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词? If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:
- User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
- Maybe you can update full-text index after create knowledge by Dify ? —— not recommend
First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit.
oh, sorry I misunderstood before.
If you want to change max segement tokens length for self-host dify, you can set INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH
in .env, like:
INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH=5000
If you use dify cloud, hmmmm.. maybe ask for @takatost
@jiaqianjing you use in cloud or self host ?
dify cloud @Weaxs
dify cloud @Weaxs
INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH
in dify cloud has raised to 4000 tokens already.
but maybe have no plan to raise to 5000 tokens I gusses...
Self Checks
1. Is this request related to a challenge you're experiencing? Tell me about your story.
have many articles, but none of them are too long, about more than 1,000 characters. However, I hope they can be retrieved separately instead of being split into several fragments. Because they are all independently and identically distributed. And the current model context has long supported characters at the million-level. There is no need to limit it so strictly (no more than 1,000 for each paragraph). This is very inconvenient and will bring errors, because the retrieved things may be omitted or pieces of different articles may be spliced together, which is not allowed.
2. Additional context or comments
No response
3. Can you help us with this feature?