rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
352 stars 11 forks source link

Evaluation: Chunking Strategy #186

Closed rmusser01 closed 1 month ago

rmusser01 commented 2 months ago

Issue to track chunking-related research for RAG. Chunking strategy currently implemented 9/26/24: User-decided, 100-9000(?) sizing, using words, sentences, paragraphs, semantic and tokens. Overlap size is also user decidable, 1- 8000 Defaults are 500/200. Contextual embedding is used, chunk headers contain the document title and a simple summary of that chunks placement in the larger document. Prompt used for generation taken from: https://www.anthropic.com/news/contextual-retrieval

Articles: 101

Eval

Cheaper implementation than using an LLM: https://www.reddit.com/r/Rag/comments/1f0q2b7/rethinking_markdown_splitting_for_rag_context/