The overlapping length of the segments seems to be ineffective. Please help, thank you

lhr1998 commented 6 months ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Pleas do not modify this template :) and fill in all the required fields.

Provide a description of the help you need

I added segment overlap length when creating the knowledge base, but it doesn't seem to have much effect. The segment preview, including the retrieved paragraphs during use, does not contain contextual content

dosubot[bot] commented 6 months ago

🤖

Based on the information you've provided, it seems like the issue might be related to how the segment overlap length or chunk_overlap value is being passed to the create_segment_to_index_task method.

In the create_segment_to_index_task method, the segment is created and indexed based on the content and metadata of the document. However, it doesn't seem like the chunk_overlap value is being used in this method. This could be why the segment overlap length specified during the creation of the knowledge base is not being effective.

The chunk_overlap value is used in the FixedRecursiveCharacterTextSplitter or EnhanceRecursiveCharacterTextSplitter instance during the text extraction and transformation process in the indexing runner. This is where the document is split into smaller segments based on the specified chunk_size and chunk_overlap.

character_splitter = FixedRecursiveCharacterTextSplitter.from_encoder(
    chunk_size=segmentation["max_tokens"],
    chunk_overlap=segmentation.get('chunk_overlap', 0),
    fixed_separator=separator,
    separators=["\n\n", "。", ".", " ", ""],
    embedding_model_instance=embedding_model_instance
)

In this code, chunk_size is the maximum number of tokens in each segment, and chunk_overlap is the number of tokens that consecutive segments should overlap. This overlap can help ensure that no important information is lost at the boundaries between segments.

Unfortunately, I couldn't find how the chunk_overlap value is being passed to the create_segment_to_index_task method in the repository. It's possible that this information is available elsewhere or I might have missed it.

As for the screenshot you've shared, I'm sorry but I'm unable to view images. Could you please provide more details about what the screenshot shows? This could help in understanding the issue better.

If you could provide more information about how you're setting the chunk_overlap value and how it's being used in your code, it would be helpful in diagnosing the issue.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

JohnJyong commented 5 months ago

this issue has fixed in the latest code ，thanks for your feedback @lhr1998

langgenius / dify