nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
69.12k stars 7.59k forks source link

Provide new chunking strategies in localdocs #2635

Open manyoso opened 2 months ago

manyoso commented 2 months ago

Currently we do a character/word based chunking that is very simple. We should enhance our chunking strategies to possibly include:

Here is some possible literature:

ThiloteE commented 2 months ago

Semantic Chunking in practice: https://boudhayan-dev.medium.com/semantic-chunking-in-practice-23a8bc33d56d Basic RAG vs Advanced RAG: https://medium.com/llamaindex-blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b

I also think a very natural long term goal for GPT4All could be having responses based on Agents using knowledge graphs fed in via RAG using Nomic Maps (but that goes beyond a simple "chunking strategy").

kalle07 commented 1 month ago

is it at least possible to change easy embedding models ? (i dont know EN and Cina seems OK, but maybe 5% are german user) https://huggingface.co/aari1995/German_Semantic_V3

ThiloteE commented 3 weeks ago

Somebody went all in on RegEx lol

Jina AI Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.

GU9E5Z6X0AEQrxX

Source: https://x.com/JinaAI_/status/1823756993108304135

manyoso commented 3 weeks ago

Somebody went all in on RegEx lol

Jina AI Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex. Source: https://x.com/JinaAI_/status/1823756993108304135

This is assuming the text is even structured properly. The pdf text we get right now does not have formatting really to regex on very much.