Evaluation: Chunking Strategy

Issue to track chunking-related research for RAG. Chunking strategy currently implemented 9/26/24: User-decided, 100-9000(?) sizing, using words, sentences, paragraphs, semantic and tokens. Overlap size is also user decidable, 1- 8000 Defaults are 500/200. Contextual embedding is used, chunk headers contain the document title and a simple summary of that chunks placement in the larger document. Prompt used for generation taken from: https://www.anthropic.com/news/contextual-retrieval

Articles: 101

https://pub.towardsai.net/rag-in-production-chunking-decisions-96a214dbbdc6
https://pub.towardsai.net/how-to-optimize-chunk-sizes-for-rag-in-production-fae9019796b6?gi=9d7db4d7605e&sk=fc71801173220cae4de506f2a83abe8c

Eval

https://research.trychroma.com/evaluating-chunking

Cheaper implementation than using an LLM: https://www.reddit.com/r/Rag/comments/1f0q2b7/rethinking_markdown_splitting_for_rag_context/

rmusser01 / tldw

Evaluation: Chunking Strategy #186