vocodedev / vocode-core

🤖 Build voice-based LLM agents. Modular + open source.
https://vocode.dev
MIT License
2.72k stars 459 forks source link

VectorDB #45

Closed AndromedaPerseus closed 4 months ago

AndromedaPerseus commented 1 year ago

It would be great if there was support for querying over vector databases such as Pinecone.

rjheeta commented 1 year ago

I'm interested in this too. However, see here: https://github.com/vocodedev/vocode-python/pull/264

@HHousen Is this functional now?

HHousen commented 1 year ago

@rjheeta @EyeOfHorus396 Yes! You can add documents to a pinecone index by following the directions from langchain: https://python.langchain.com/docs/integrations/vectorstores/pinecone. It should look something like the following:

import os
import pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import SpacyTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import DirectoryLoader, UnstructuredFileLoader

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
PINECONE_ENVIRONMENT = os.environ["PINECONE_ENVIRONMENT"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

loader = DirectoryLoader('./docs', glob="**/*.*", show_progress=True, loader_cls=UnstructuredFileLoader)
print("Loading documents...")
documents = loader.load()
text_splitter = SpacyTextSplitter(chunk_size=1000)
print("Splitting documents...")
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT,
)

index_name = "your_index_name"

print("Creating index...")
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

(The above script loads documents of many formats from the docs/ directory and adds them to an index.)

Then, create an agent like this:

from vocode.streaming.models.vector_db import PineconeConfig

ChatGPTAgent(
    ChatGPTAgentConfig(
        ...
        vector_db_config=PineconeConfig(index="your_index_name")
    ),
)

With this config, before every user message, the top 3 related documents will be injected into the conversation history sent to the ChatGPT OpenAI API.

rjheeta commented 1 year ago

@HHousen Thank you for the great explanation and the sample code! I just want to clarify your intent with the following design:

before every user message, the top 3 related documents will be injected into the conversation history sent to the ChatGPT OpenAI API.

My understanding is this will leverage both 1) ChatGPT's existing training data with 2) the custom context that we build into Pinecone DB? I've seen implementations of Pinecone + LLMs where it purely uses Pinecone, and in these we lose the versatility that GPT offered in the first place. I want to use Pinecone to "push" my LLM in the right direction with my custom KB, but not to entirely erode the LLMs original versatility of its original training.

It looks your approach addresses this already, but I just wanted to clarify. Thank you

HHousen commented 1 year ago

Yes, you're correct. In this implementation pinecone just provides additional information that ChatGPT can use in its response.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed due to inactivity. Thank you for your contributions.