Open manyoso opened 2 months ago
Semantic Chunking in practice: https://boudhayan-dev.medium.com/semantic-chunking-in-practice-23a8bc33d56d Basic RAG vs Advanced RAG: https://medium.com/llamaindex-blog/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b
I also think a very natural long term goal for GPT4All could be having responses based on Agents using knowledge graphs fed in via RAG using Nomic Maps (but that goes beyond a simple "chunking strategy").
is it at least possible to change easy embedding models ? (i dont know EN and Cina seems OK, but maybe 5% are german user) https://huggingface.co/aari1995/German_Semantic_V3
Somebody went all in on RegEx lol
Jina AI Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex.
Somebody went all in on RegEx lol
Jina AI Based. Semantic chunking is overrated. Especially when you write a super regex that leverages all possible boundary cues and heuristics to segment text accurately without the need for complex language models. Just think about the speed and the hosting cost. This 50-line, 2490-character regex is as powerful as it can be within the limitations of regex. Source: https://x.com/JinaAI_/status/1823756993108304135
This is assuming the text is even structured properly. The pdf text we get right now does not have formatting really to regex on very much.
Currently we do a character/word based chunking that is very simple. We should enhance our chunking strategies to possibly include:
Here is some possible literature: