langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.49k stars 7.24k forks source link

Full-text entry support for the knowledge base rather than being divided into several paragraphs #7720

Closed jiaqianjing closed 1 week ago

jiaqianjing commented 2 months ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

have many articles, but none of them are too long, about more than 1,000 characters. However, I hope they can be retrieved separately instead of being split into several fragments. Because they are all independently and identically distributed. And the current model context has long supported characters at the million-level. There is no need to limit it so strictly (no more than 1,000 for each paragraph). This is very inconvenient and will bring errors, because the retrieved things may be omitted or pieces of different articles may be spliced together, which is not allowed.

2. Additional context or comments

No response

3. Can you help us with this feature?

Sakura4036 commented 2 months ago

I agree

Weaxs commented 2 months ago

I thought you do not want Chinese text segmentation ?

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.

  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

jiaqianjing commented 2 months ago

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词?

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit.

image
Weaxs commented 2 months ago

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词? If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit. image

oh, sorry I misunderstood before.

If you want to change max segement tokens length for self-host dify, you can set INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in .env, like:

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH=5000

If you use dify cloud, hmmmm.. maybe ask for @takatost

Weaxs commented 2 months ago

@jiaqianjing you use in cloud or self host ?

jiaqianjing commented 2 months ago

dify cloud @Weaxs

Weaxs commented 1 month ago

dify cloud @Weaxs

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in dify cloud has raised to 4000 tokens already.

but maybe have no plan to raise to 5000 tokens I gusses...

image