[Feature Request]: Split at prepared deliminater instead of token splitting.

microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system

https://microsoft.github.io/graphrag/

MIT License

16.84k stars 1.58k forks source link

[Feature Request]: Split at prepared deliminater instead of token splitting. #733

Open RobertHH-IS opened 1 month ago

RobertHH-IS commented 1 month ago

Is your feature request related to a problem? Please describe.

A big part of good rag is the quality of the input data. I would want to specifically prepare chunks with text and metadata for the graph extraction. A simple "delim" splitter would be a great addition opposed to the much more random character or token chunker.

Describe the solution you'd like

Allow us to specifiy delim in the chunks settings.yaml. If it is specified, it will not do any chunking, simply split at the delim and proceed from there.

Additional context

No response

natoverse commented 1 month ago

Does this address your issue? As long as each document you define stays under the chunk size, GraphRAG will avoid splitting it. https://github.com/microsoft/graphrag/issues/396#issuecomment-2249127128

RobertHH-IS commented 1 month ago

Does this address your issue? As long as each document you define stays under the chunk size, GraphRAG will avoid splitting it. https://github.com/microsoft/graphrag/issues/396#issuecomment-2249127128

A bit hacky but a way to proceed - thanks! A delim option would prevent the need for 600.000 files though :-)

natoverse commented 1 month ago

Great, thanks - I'll queue this up, but good to hear you have a path forward in the meantime.