tmc / langchaingo

LangChain for Go, the easiest way to write LLM-based programs in Go
https://tmc.github.io/langchaingo/
MIT License
3.73k stars 518 forks source link

textsplitter: add WithHeadingHierarchy option to markdown splitter to retain heading hierarchy in chunks #898

Closed iwilltry42 closed 1 week ago

iwilltry42 commented 1 week ago

This PR adds a WithHeadingHierarchy option to the markdown textsplitter to prepend the heading hierarchy to chunks in Markdown documents instead of only current heading. This improves retrieval of relevant chunks by a higher-level heading.

Example

Input:

# h1

foobar

## h2

bazbom

#### h4

spam eggs

... will yield the last chunk as

# h1
## h2
#### h4

spam eggs

... so that a semantic search by either h1, h2 or h4 will likely return this chunk.

PR Checklist