Support Contextual Retrieval

Weaxs commented 2 months ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

https://www.anthropic.com/news/contextual-retrieval

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

2. Additional context or comments

have 2 problem:

should support change llm for explaining the chunks, maybe can use system llm in the beginning.
If file content context large than max context size which llm supported, maybe not explain/summarize automatically ?

3. Can you help us with this feature?

[X] I am interested in contributing to this feature.

tobegit3hub commented 1 month ago

What is the current implementation progress of contextual retrieval? @Weaxs

We are really interested in this feature and we would like to help to implement this.

FreshLucas-git commented 1 month ago

@Weaxs Hi. Is any update in this feature?

Weaxs commented 1 month ago

@Weaxs Hi. Is any update in this feature?

sorry, I do not start this feature yet.

I will try to figure it out before Nov. , and submit pr and review maybe ... Dec. I’ll try as soon as possible. 🥺

Weaxs commented 3 weeks ago

user chooses support contextual-retrieval

user chooses to enabel [contextual-retrieval] in front-end step-two
save rule in DatasetProcessRule (table: dataset_process_rules)

document upload and process

add ContextualRecursiveCharacterTextSplitter for contextual-retrieval text splitter (call _text_splitter_instance.split_text)
text splitter by _text_splitter_instance with token - 50~100
assemble prompt with total document and chunk, and call summarize_model_instance for summarize contextual message (this will consume system llm tokens)
joint the summary with the chunk to form a new chunk

langgenius / dify