Open codefromthecrypt opened 2 weeks ago
I have quite a few hypotheses for RAG (smarter chunking, using a more relevant embedding LLM, investigating similarity calculations, ...) but I need to find the time.
And I also need to figure out why my Elastic example is no longer working (Kibana does not start correctly). 😢
Elastic sample issue fixed 🎉
So, the first issue is probably how we chunk the document.
There is a chunking function with overlap chunks := content.ChunkText(documentationContent, 500, 100)
but I did some tests, and it's not a solution because we will lose the document's semantics.
I have different possible solutions:
SplitMarkdownByLevelSections(content string, level int)
)First, I will work on the first solution. At the same time, I will investigate some NLP techniques like the Jaccard similarity coeff and the Levenshtein distance.
I added: content.ParseMarkdown
https://github.com/parakeet-nest/parakeet/blob/main/content/advanced_chunkers.go
The result seems better: https://github.com/parakeet-nest/parakeet/tree/main/examples/39-rag-with-elastic-markdown
Good!
q: so [Brief]
is that something that specifically works well in llama3.1?
I'm happy with the results of this, indeed with or without elastic they are better. I assume the key is the chunking, and not sure if model was required or not. I tried qwen2, but it seemed to just retort the sections as-is.
So, my understanding is:
RAG is best when the embedded data preserves semantic context. In many documents, there are a hierarchy of sections. When splitting a document into chunks, do so on section boundaries, and make obvious the section name. If a section is larger than the chunk size, take care to overlap some text so it knows it is a continuation. These are examples of approaches that help the LLM learn a more accurate representation of text embeddings.
What I noticed in the chunking is that each chunk, the prompt only includes the section title (e.g. ## crypto/x509\n\n
), but not the hierarchy (e.g. ## Go 1.23 Release Notes crypto/x509
), and somehow it works ;) If I remove the title prefix it is much less effective. So, the prefixing helps a lot, though not sure how it knows it is in the hierarchy of Go 1.23 except that most sections also say that.
p.s. I notice all the sections fit in one chunk each, so maybe my thoughts about overlap aren't proven here.
@codefromthecrypt, thanks for the remarks and questions 🙏
About [Brief]
, it's a "meta prompt"; it works for a lot of models (not for the "small qwen2"; it is not enough "disciplined" for that 😉)
By the way, other meta prompts exist; I did some helpers for that: https://github.com/parakeet-nest/parakeet/blob/main/prompt/meta.go.
I have the same understanding as you, but I wouldn't say I like the overlap technique. I think we should use it only with a document without structure or with a few structures.
To my mind, every chunk should be done cleverly: a single chunk has to be understandable even alone (with a signification), but a chunk must not be too big (especially with a small LLM) The tricky thing is to "keep the link" if several chunks are related (perhaps by adding metadata like keywords to every related chunk). I need to study this topic more (I plan to read this: https://www.manning.com/books/knowledge-graph-enhanced-rag).
Regarding content.ParseMarkdown
, I'm not totally sure why it's better (I had a long chat with ChatGPT to try to understand how to parse a markdown document while keeping the semantics of the document and did a lot of "tries")
Your idea to keep the hierarchy is pretty good 👍 (I will see If I can do something with this)
thanks for all the insight, research and code. We can keep this open or you can close it whenever you like
Keep it open 😄
btw I used another embedding model: embeddingsModel := "mxbai-embed-large"
@codefromthecrypt I did a new function to chunk the markdown content (ParseMarkdownWithHierarchy
):
It produces:
[]Chunk struct {
Header string
Content string
Level int
Prefix string
ParentLevel int
ParentHeader string
ParentPrefix string
}
Then, I can keep a "link" between a section and its parent section (I think I can improve it more by adding all the child sections)
Then, with this, you can prepare the content for the embeddings like this (for example):
### Trace {#trace}
<!-- Parent Section: ## Tools {#tools} -->
<!-- go.dev/issue/65316 -->
The `trace` tool now better tolerates partially broken traces by attempting to
recover what trace data it can. This functionality is particularly helpful when
viewing a trace that was collected during a program crash, since the trace data
leading up to the crash will now [be recoverable](/issue/65319) under most
circumstances.
You can test it here: https://github.com/parakeet-nest/parakeet/tree/main/examples/40-rag-with-elastic-markdown
Ok, I added a new one, ParseMarkdownWithLineage
[]Chunk struct {
Header string
Content string
Level int
Prefix string
ParentLevel int
ParentHeader string
ParentPrefix string
}
Then you can add more "link":
#### [`path/filepath`](/pkg/path/filepath/)
<!-- Parent Section: ### Minor changes to the library {#minor_library_changes} -->
<!-- Lineage: Standard library {#library} > Minor changes to the library {#minor_library_changes} > [`path/filepath`](/pkg/path/filepath/) -->
The new [`Localize`](/pkg/path/filepath#Localize) function safely converts a slash-separated
path into an operating system path.
On Windows, [`EvalSymlinks`](/pkg/path/filepath#EvalSymlinks) no longer evaluates mount points,
which was a source of many inconsistencies and bugs.
This behavior is controlled by the `winsymlink` setting.
For Go 1.23, it defaults to `winsymlink=1`.
Previous versions default to `winsymlink=0`.
On Windows, [`EvalSymlinks`](/pkg/path/filepath#EvalSymlinks) no longer tries to normalize
volumes to drive letters, which was not always even possible.
This behavior is controlled by the `winreadlinkvolume` setting.
For Go 1.23, it defaults to `winreadlinkvolume=1`.
Previous versions default to `winreadlinkvolume=0`.
thanks, having a look quickly before flight! p.s. on 40 I have an error unrelated to the RAG approach:
$ echo $USER
adriancole
$ docker compose up -d
[+] Running 4/6
✔ Network better-rag_default Created 0.1s
✔ Volume "better-rag_kibanadata" Created 0.0s
✔ Volume "better-rag_esdata" Created 0.0s
⠹ Container better-rag-es01-1 Starting 0.2s
⠹ Container better-rag-setup-1 Starting 0.2s
✔ Container better-rag-kibana-1 Created 0.0s
Error response from daemon: error while creating mount source path '/Users/adriancole/oss/parakeet/examples/40-rag-with-elastic-markdown/certs': chown /Users/adriancole/oss/parakeet/examples/40-rag-with-elastic-markdown/certs: permission denied
fwiw I ran again and somehow it didn't mind 🤷
I've tried both html and markdown and not sure how I can get a correct answer. Possibly, it is how I am setting up my prompt, or the question itself. That said, possibly the input isn't clean enough. I've read in cases when chunking we want an overlap, but not sure that's the issue really. Can you have a look?
I was hoping to use parakeet to teach me about changes to go ;) Either of the following
Here's an example diff