Seems to forget about earlier documents

cs96and commented 1 year ago

If I upload a few documents, then it seems to forget about ones that I uploaded earlier. Is there a limit to the number of documents or tokens it will store per user?

eabjab commented 1 year ago

Having the same issue. It seems to only be able to reference the most recently uploaded document in my testing, even when running locally using my own pinecone index.

AitoD commented 1 year ago

Having the same issue, only the latest uploaded documents are referenced. It seems to overwrite the index whenever you upload a new file.

Ninn0x4F commented 1 year ago

Same issue here as well, I suspect the issue might be related to Pinecone itself, uploaded multiple files and the Vectors stopped increasing around 4k afterwards uploading any new documents will only reference the last 2 documents for me currently.

I used 1536 for Dimensions and left the rest default. I uploaded my first document (XML File with multiple Item IDs) and asked it for the first and last ItemID in the document, it got it nearly right. Then I continued to upload more files with same structure, asked periodically to reference first and last ItemID and it started to behave as if it only saw the last 2 documents uploaded.

lonelycode commented 1 year ago

The issue is in how the upload code assigns IDs here:

    for i, embedding := range embeddings {
        chunk := chunks[i]
        vectors[i] = PineconeVector{
            ID:     fmt.Sprintf("id-%d", i),
            Values: embedding,
            Metadata: map[string]string{
                "file_name": chunk.Title,
                "start":     strconv.Itoa(chunk.Start),
                "end":       strconv.Itoa(chunk.End),
                "title":     chunk.Title,
                "text":      chunk.Text,
            },
        }
    }

The ID of the vector is just a counter for the chunk in the single file, in a multi-file upload, the ID's will overlap (ID 001 will repeat for file 1 and file 2 etc.). The UPSERT operation will update or insert based on the ID, essentially overwriting vectors for pre-processed files.

Switching this to an ID that is a hash of the filename plus the chunk number would ensure uniqueness.

For example:

func HashFileName(filename string) string {
    hash := sha256.Sum256([]byte(filename))
    return hex.EncodeToString(hash[:])
}

//...

func (p *Pinecone) UploadEmbeddings(embeddings [][]float32, chunks []Chunk) error {
    // Prepare URL
    url := p.APIEndpoint + "/vectors/upsert"

    // Prepare the vectors
    vectors := make([]PineconeVector, len(embeddings))
    for i, embedding := range embeddings {
        vectorID := fmt.Sprintf("id-%s-%d", HashFileName(chunks[i].Title), i)
        vectors[i] = PineconeVector{
            ID:     vectorID,
            Values: embedding,
// ...

pashpashpash / vault-ai

Seems to forget about earlier documents #9