gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
29 stars 14 forks source link

add: basic markdown splitter with limited options #36

Closed iwilltry42 closed 4 months ago

iwilltry42 commented 4 months ago

This PR adds a pretty basic markdown text splitter, that only considers headings for splitting. One can choose up to which level of headings to split the text and result chunks which are larger than the defined chunkSize will be chunked further by a secondary splitter (default is a recursiveCharacterSplitter). Every chunk will be prefixed with the whole markdown heading hierarchy, improving the semantic search results. Optionally, chunks that consist of headings only (i.e. no content) can be ignored/dropped.

The code follows the functional options pattern like the golc and langchaingo libs for consistency.

iwilltry42 commented 4 months ago

actually this is not important enough to steal your time for reviews :)