langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
44.29k stars 6.2k forks source link

The overlapping length of the segments seems to be ineffective. Please help, thank you #2555

Closed lhr1998 closed 5 months ago

lhr1998 commented 6 months ago

Self Checks

Provide a description of the help you need

I added segment overlap length when creating the knowledge base, but it doesn't seem to have much effect. The segment preview, including the retrieved paragraphs during use, does not contain contextual content image

dosubot[bot] commented 6 months ago

🤖

Based on the information you've provided, it seems like the issue might be related to how the segment overlap length or chunk_overlap value is being passed to the create_segment_to_index_task method.

In the create_segment_to_index_task method, the segment is created and indexed based on the content and metadata of the document. However, it doesn't seem like the chunk_overlap value is being used in this method. This could be why the segment overlap length specified during the creation of the knowledge base is not being effective.

The chunk_overlap value is used in the FixedRecursiveCharacterTextSplitter or EnhanceRecursiveCharacterTextSplitter instance during the text extraction and transformation process in the indexing runner. This is where the document is split into smaller segments based on the specified chunk_size and chunk_overlap.

character_splitter = FixedRecursiveCharacterTextSplitter.from_encoder(
    chunk_size=segmentation["max_tokens"],
    chunk_overlap=segmentation.get('chunk_overlap', 0),
    fixed_separator=separator,
    separators=["\n\n", "。", ".", " ", ""],
    embedding_model_instance=embedding_model_instance
)

In this code, chunk_size is the maximum number of tokens in each segment, and chunk_overlap is the number of tokens that consecutive segments should overlap. This overlap can help ensure that no important information is lost at the boundaries between segments.

Unfortunately, I couldn't find how the chunk_overlap value is being passed to the create_segment_to_index_task method in the repository. It's possible that this information is available elsewhere or I might have missed it.

As for the screenshot you've shared, I'm sorry but I'm unable to view images. Could you please provide more details about what the screenshot shows? This could help in understanding the issue better.

If you could provide more information about how you're setting the chunk_overlap value and how it's being used in your code, it would be helpful in diagnosing the issue.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
JohnJyong commented 5 months ago

this issue has fixed in the latest code ,thanks for your feedback @lhr1998