Provide an example of externally sourced content

codefromthecrypt commented 2 weeks ago

I've tried both html and markdown and not sure how I can get a correct answer. Possibly, it is how I am setting up my prompt, or the question itself. That said, possibly the input isn't clean enough. I've read in cases when chunking we want an overlap, but not sure that's the issue really. Can you have a look?

I was hoping to use parakeet to teach me about changes to go ;) Either of the following

Here's an example diff

--- a/examples/36-rag-with-asciidoc/create-embeddings/main.go
+++ b/examples/36-rag-with-asciidoc/create-embeddings/main.go
@@ -21,12 +21,13 @@ func main() {
                log.Fatalln("😡:", err)
        }

-       rulesContent, err := content.ReadTextFile("./chronicles.adoc")
+       //https://raw.githubusercontent.com/golang/website/master/_content/doc/go1.23.md
+       rulesContent, err := content.ReadTextFile("./go1.23.md")
        if err != nil {
                log.Fatalln("😡:", err)
        }

-       chunks := content.SplitAsciiDocBySections(rulesContent)
+       chunks := content.SplitMarkdownBySections(rulesContent)

        // Create embeddings from documents and save them in the store
        for idx, doc := range chunks {
diff --git a/examples/36-rag-with-asciidoc/use-embeddings/main.go b/examples/36-rag-with-asciidoc/use-embeddings/main.go
index edc1ca3..64fc426 100644
--- a/examples/36-rag-with-asciidoc/use-embeddings/main.go
+++ b/examples/36-rag-with-asciidoc/use-embeddings/main.go
@@ -21,13 +21,10 @@ func main() {
                log.Fatalln("😡:", err)
        }

-       systemContent := `You are the dungeon master,
-       expert at interpreting and answering questions based on provided sources.
-       Using only the provided context, answer the user's question 
-       to the best of your ability using only the resources provided. 
-       Be verbose!`
+       systemContent := `Using only the provided context, answer the user's question 
+       to the best of your ability using only the resources provided.`

-       userContent := `Who are the monsters of Chronicles of Aethelgard?`
+       userContent := `What changes to the archive/tar library happened in Go 1.23`
        //userContent := `Tell me more about Keegorg`

        // Create an embedding from the question

k33g commented 2 weeks ago

I have quite a few hypotheses for RAG (smarter chunking, using a more relevant embedding LLM, investigating similarity calculations, ...) but I need to find the time.

And I also need to figure out why my Elastic example is no longer working (Kibana does not start correctly). 😢

k33g commented 2 weeks ago

Elastic sample issue fixed 🎉

k33g commented 2 weeks ago

I did a test with ES, and the result of the similarity search is not better
I did the same test with a "bigger" LLM; it's not better (even worse 🤔 )

So, the first issue is probably how we chunk the document.

There is a chunking function with overlap chunks := content.ChunkText(documentationContent, 500, 100) but I did some tests, and it's not a solution because we will lose the document's semantics.

I have different possible solutions:

Add manually meta-data into the document (keywords, ...) + try with other embedding LLMs
Allow choosing the level of the section you want to split (I added a method SplitMarkdownByLevelSections(content string, level int))
Add meta-data automatically when creating a chunk (like the title of the +1 level section, to keep a link)

First, I will work on the first solution. At the same time, I will investigate some NLP techniques like the Jaccard similarity coeff and the Levenshtein distance.

k33g commented 2 weeks ago

I added: content.ParseMarkdown https://github.com/parakeet-nest/parakeet/blob/main/content/advanced_chunkers.go

The result seems better: https://github.com/parakeet-nest/parakeet/tree/main/examples/39-rag-with-elastic-markdown

codefromthecrypt commented 2 weeks ago

Good!

q: so [Brief] is that something that specifically works well in llama3.1?

I'm happy with the results of this, indeed with or without elastic they are better. I assume the key is the chunking, and not sure if model was required or not. I tried qwen2, but it seemed to just retort the sections as-is.

So, my understanding is:

RAG is best when the embedded data preserves semantic context. In many documents, there are a hierarchy of sections. When splitting a document into chunks, do so on section boundaries, and make obvious the section name. If a section is larger than the chunk size, take care to overlap some text so it knows it is a continuation. These are examples of approaches that help the LLM learn a more accurate representation of text embeddings.

What I noticed in the chunking is that each chunk, the prompt only includes the section title (e.g. ## crypto/x509\n\n), but not the hierarchy (e.g. ## Go 1.23 Release Notes crypto/x509), and somehow it works ;) If I remove the title prefix it is much less effective. So, the prefixing helps a lot, though not sure how it knows it is in the hierarchy of Go 1.23 except that most sections also say that.

p.s. I notice all the sections fit in one chunk each, so maybe my thoughts about overlap aren't proven here.

k33g commented 2 weeks ago

@codefromthecrypt, thanks for the remarks and questions 🙏

About [Brief], it's a "meta prompt"; it works for a lot of models (not for the "small qwen2"; it is not enough "disciplined" for that 😉)

By the way, other meta prompts exist; I did some helpers for that: https://github.com/parakeet-nest/parakeet/blob/main/prompt/meta.go.

I have the same understanding as you, but I wouldn't say I like the overlap technique. I think we should use it only with a document without structure or with a few structures.

To my mind, every chunk should be done cleverly: a single chunk has to be understandable even alone (with a signification), but a chunk must not be too big (especially with a small LLM) The tricky thing is to "keep the link" if several chunks are related (perhaps by adding metadata like keywords to every related chunk). I need to study this topic more (I plan to read this: https://www.manning.com/books/knowledge-graph-enhanced-rag).

Regarding content.ParseMarkdown, I'm not totally sure why it's better (I had a long chat with ChatGPT to try to understand how to parse a markdown document while keeping the semantics of the document and did a lot of "tries")

Your idea to keep the hierarchy is pretty good 👍 (I will see If I can do something with this)

codefromthecrypt commented 2 weeks ago

thanks for all the insight, research and code. We can keep this open or you can close it whenever you like

k33g commented 2 weeks ago

Keep it open 😄 btw I used another embedding model: embeddingsModel := "mxbai-embed-large"

k33g commented 2 weeks ago

@codefromthecrypt I did a new function to chunk the markdown content (ParseMarkdownWithHierarchy): It produces:

[]Chunk struct {
    Header       string
    Content      string
    Level        int
    Prefix       string
    ParentLevel  int
    ParentHeader string
    ParentPrefix string
}

Then, I can keep a "link" between a section and its parent section (I think I can improve it more by adding all the child sections)

Then, with this, you can prepare the content for the embeddings like this (for example):

### Trace {#trace} 

 <!-- Parent Section: ## Tools {#tools} --> 

 <!-- go.dev/issue/65316 -->
The `trace` tool now better tolerates partially broken traces by attempting to
recover what trace data it can. This functionality is particularly helpful when
viewing a trace that was collected during a program crash, since the trace data
leading up to the crash will now [be recoverable](/issue/65319) under most
circumstances.

You can test it here: https://github.com/parakeet-nest/parakeet/tree/main/examples/40-rag-with-elastic-markdown

k33g commented 2 weeks ago

Ok, I added a new one, ParseMarkdownWithLineage

[]Chunk struct {
    Header       string
    Content      string
    Level        int
    Prefix       string
    ParentLevel  int
    ParentHeader string
    ParentPrefix string
}

Then you can add more "link":

#### [`path/filepath`](/pkg/path/filepath/) 

 <!-- Parent Section: ### Minor changes to the library {#minor_library_changes} --> 

 <!-- Lineage: Standard library {#library} > Minor changes to the library {#minor_library_changes} > [`path/filepath`](/pkg/path/filepath/) --> 

 The new [`Localize`](/pkg/path/filepath#Localize) function safely converts a slash-separated
path into an operating system path.

On Windows, [`EvalSymlinks`](/pkg/path/filepath#EvalSymlinks) no longer evaluates mount points,
which was a source of many inconsistencies and bugs.
This behavior is controlled by the `winsymlink` setting.
For Go 1.23, it defaults to `winsymlink=1`.
Previous versions default to `winsymlink=0`.

On Windows, [`EvalSymlinks`](/pkg/path/filepath#EvalSymlinks) no longer tries to normalize
volumes to drive letters, which was not always even possible.
This behavior is controlled by the `winreadlinkvolume` setting.
For Go 1.23, it defaults to `winreadlinkvolume=1`.
Previous versions default to `winreadlinkvolume=0`.

codefromthecrypt commented 2 weeks ago

thanks, having a look quickly before flight! p.s. on 40 I have an error unrelated to the RAG approach:

$ echo $USER
adriancole
$ docker compose up -d
[+] Running 4/6
 ✔ Network better-rag_default      Created                                                                    0.1s 
 ✔ Volume "better-rag_kibanadata"  Created                                                                    0.0s 
 ✔ Volume "better-rag_esdata"      Created                                                                    0.0s 
 ⠹ Container better-rag-es01-1     Starting                                                                   0.2s 
 ⠹ Container better-rag-setup-1    Starting                                                                   0.2s 
 ✔ Container better-rag-kibana-1   Created                                                                    0.0s 
Error response from daemon: error while creating mount source path '/Users/adriancole/oss/parakeet/examples/40-rag-with-elastic-markdown/certs': chown /Users/adriancole/oss/parakeet/examples/40-rag-with-elastic-markdown/certs: permission denied

codefromthecrypt commented 2 weeks ago

fwiw I ran again and somehow it didn't mind 🤷

parakeet-nest / parakeet

Provide an example of externally sourced content #16