pashpashpash / vault-ai

OP Vault ChatGPT: Give ChatGPT long-term memory using the OP Stack (OpenAI + Pinecone Vector Database). Upload your own custom knowledge base files (PDF, txt, epub, etc) using a simple React frontend.
https://vault.pash.city
MIT License
3.26k stars 307 forks source link

Multiple file uploads overwriting previous embeddings #24

Open Aemon-Algiz opened 1 year ago

Aemon-Algiz commented 1 year ago
    vectors := make([]PineconeVector, len(embeddings))
    for i, embedding := range embeddings {
        chunk := chunks[i]
        vectors[i] = PineconeVector{
            ID:     fmt.Sprintf("id-%d", i),
            Values: embedding,
            Metadata: map[string]string{
                "file_name": chunk.Title,
                "start":     strconv.Itoa(chunk.Start),
                "end":       strconv.Itoa(chunk.End),
                "title":     chunk.Title,
                "text":      chunk.Text,
            },
        }
    }

This works well for batched uploads, though the previous embeddings are overwritten for multiple uploads. UUID's would allow for multiple uploads unless this was intentional to prevent your Pinecone instance from becoming massive. If that's the case, perhaps you could have a public flag, which would use some other ID scheme for private instances?

AitoD commented 1 year ago

I fixed this by changing the id to uuid so it doesnt overwrite the previously stored files when uploading new stuff.

Will add a pr later today.

Yobo123o commented 1 year ago

Any updates on this issue? @AitoD Can you please elaborate on what you changed to use uuids rather than ids? I am running into the overwriting issue when trying to compile a knowledge-base.

Ctrl-Alt-Rage commented 1 year ago

@AitoD

I tried changing a few things in the pinecone.go and postapi files but kept experiencing issues with npm start giving me an error of.

postapi\pinecone.go:29:14: uuid.New undefined (type string has no field or method New) ... error

and

postapi\pinecone.go:97:2: syntax error: non-declaration statement outside function body ... error

obaqueiro commented 1 year ago

Got it. I think this works:

diff --git a/vault-web-server/postapi/pinecone.go b/vault-web-server/postapi/pinecone.go
index 2d8f1bd..0f9fae3 100644
--- a/vault-web-server/postapi/pinecone.go
+++ b/vault-web-server/postapi/pinecone.go
@@ -11,6 +11,7 @@ import (
        "math"
        "net/http"
        "strconv"
+       googleid "github.com/google/uuid"
 )

 type PineconeVector struct {
@@ -27,8 +28,9 @@ func upsertEmbeddingsToPinecone(embeddings [][]float32, chunks []Chunk, uuid str
        vectors := make([]PineconeVector, len(embeddings))
        for i, embedding := range embeddings {
                chunk := chunks[i]
+               myuuid := googleid.NewString()
                vectors[i] = PineconeVector{
-                       ID:     fmt.Sprintf("id-%d", i),
+                       ID:     fmt.Sprintf("id-%s", myuuid),
                        Values: embedding,
                        Metadata: map[string]string{
                                "file_name": chunk.Title,

Basically import google uuid library with a different name so as not to be overwritten by func param, and then use it to generate a uuid.


I don't have the code handy but it seems to me that the code that needs to be changed is:

ID: fmt.Sprintf("id-%d", i)

According to ChatGPT one way to generate a UUID in go is :

package main

import (
    "github.com/satori/go.uuid"
)

func main() {
    myuuid, err := uuid.NewV4()
}

So that may do the trick. I'll try it later when I have access to the code.

Ctrl-Alt-Rage commented 1 year ago

Got it. I think this works:

diff --git a/vault-web-server/postapi/pinecone.go b/vault-web-server/postapi/pinecone.go
index 2d8f1bd..0f9fae3 100644
--- a/vault-web-server/postapi/pinecone.go
+++ b/vault-web-server/postapi/pinecone.go
@@ -11,6 +11,7 @@ import (
        "math"
        "net/http"
        "strconv"
+       googleid "github.com/google/uuid"
 )

 type PineconeVector struct {
@@ -27,8 +28,9 @@ func upsertEmbeddingsToPinecone(embeddings [][]float32, chunks []Chunk, uuid str
        vectors := make([]PineconeVector, len(embeddings))
        for i, embedding := range embeddings {
                chunk := chunks[i]
+               myuuid := googleid.NewString()
                vectors[i] = PineconeVector{
-                       ID:     fmt.Sprintf("id-%d", i),
+                       ID:     fmt.Sprintf("id-%s", myuuid),
                        Values: embedding,
                        Metadata: map[string]string{
                                "file_name": chunk.Title,

Basically import google uuid library with a different name so as not to be overwritten by func param, and then use it to generate a uuid.

I don't have the code handy but it seems to me that the code that needs to be changed is:

ID: fmt.Sprintf("id-%d", i)

According to ChatGPT one way to generate a UUID in go is :

package main

import (
    "github.com/satori/go.uuid"
)

func main() {
    myuuid, err := uuid.NewV4()
}

So that may do the trick. I'll try it later when I have access to the code.

Would I be putting this in the pinecone.go file? I received a syntax error so I imagine it's because I'm doing something wrong haha