langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
52.47k stars 7.66k forks source link

Support Contextual Retrieval #8776

Open Weaxs opened 2 months ago

Weaxs commented 2 months ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

https://www.anthropic.com/news/contextual-retrieval

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. 

image

2. Additional context or comments

have 2 problem:

  1. should support change llm for explaining the chunks, maybe can use system llm in the beginning.
  2. If file content context large than max context size which llm supported, maybe not explain/summarize automatically ?

3. Can you help us with this feature?

tobegit3hub commented 1 month ago

What is the current implementation progress of contextual retrieval? @Weaxs

We are really interested in this feature and we would like to help to implement this.

FreshLucas-git commented 1 month ago

@Weaxs Hi. Is any update in this feature?

Weaxs commented 1 month ago

@Weaxs Hi. Is any update in this feature?

sorry, I do not start this feature yet.

I will try to figure it out before Nov. , and submit pr and review maybe ... Dec. I’ll try as soon as possible. 🥺

Weaxs commented 3 weeks ago

user chooses support contextual-retrieval

  1. user chooses to enabel [contextual-retrieval] in front-end step-two
  2. save rule in DatasetProcessRule (table: dataset_process_rules)

image

document upload and process

  1. add ContextualRecursiveCharacterTextSplitter for contextual-retrieval text splitter (call _text_splitter_instance.split_text)
  2. text splitter by _text_splitter_instance with token - 50~100
  3. assemble prompt with total document and chunk, and call summarize_model_instance for summarize contextual message (this will consume system llm tokens)
  4. joint the summary with the chunk to form a new chunk

image

other problems