stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
17.37k stars 1.33k forks source link

ChromaDB minimal example #469

Closed samiit closed 3 months ago

samiit commented 7 months ago

Hi everyone

I am trying to create a minimal running example of integrating ChromaDB with DSPy.

import chromadb
from dspy.retrieve.chromadb_rm import ChromadbRM

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="furniture")

collection.add(
    documents=[
        "couch, bed, table, chair", 
        "computer, server, table, chair"],
    metadatas=[
        {"source": "Bedroom"}, 
        {"source": "Office"}
        ],
    ids=[
        "id1", 
        "id2"]
)

# Error happens here
rm = ChromadbRM(collection_name=collection, persist_directory="local_store")

The last line goes wrong with the following message:

File D:\Sam\Projects\LLM_Apps\DSPy\VelocityDemo\venv\Lib\site-packages\chromadb\api\segment.py:76, in check_index_name(index_name)
     66 def check_index_name(index_name: str) -> None:
     67     msg = (
     68         "Expected collection name that "
     69         "(1) contains 3-63 characters, "
   (...)
     74         f"got {index_name}"
     75     )
---> 76     if len(index_name) < 3 or len(index_name) > 63:
     77         raise ValueError(msg)
     78     if not re.match("^[a-zA-Z0-9][a-zA-Z0-9._-]*[a-zA-Z0-9]$", index_name):

TypeError: object of type 'Collection' has no len()

Any suggestions, or hints at correctly using ChromaDB with DSPy?

caiobd commented 7 months ago

You seem to be passing the wrong value to the retriever, you should give it the collection name and not the collection itself. Also if you want to persist the documents locally you should probably use the PersistentClient from chroma module. Here is a minimal working example so you can build on it:

import chromadb
from chromadb.utils import embedding_functions
from dspy.retrieve.chromadb_rm import ChromadbRM

chroma_client = client = chromadb.PersistentClient(path="./furniture_example")
default_ef = embedding_functions.DefaultEmbeddingFunction()
collection = chroma_client.get_or_create_collection(name="furniture", embedding_function=default_ef)

collection.add(
    documents=[
        "couch, bed, table, chair", 
        "computer, server, table, chair"],
    metadatas=[
        {"source": "Bedroom"}, 
        {"source": "Office"}
        ],
    ids=[
        "id1", 
        "id2"
    ]
)

rm = ChromadbRM(collection_name='furniture', persist_directory="./furniture_example", embedding_function=default_ef)
print(rm('comfy'))
csaiedu commented 5 months ago

I was trying to follow up on this minimal example without going through OpenAI, using a different embedding function, but it seems that OpenAI is still chosen by default, as it requires authentification. Is there a differnt way to decalre the RM from ChromaDBRM? rm =ChromadbRM('furniture', "./furniture_example", embedding_functions.SentenceTransformerEmbeddingFunction( model_name="all-MiniLM-L6-v2"), k=3)

mlederbauer commented 5 months ago

@csaiedu Does perhaps this work for you? As per the ChromaDB documentation, “by default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings”, which seems to be the embedding function you were using in the example.

from chromadb.utils import embedding_functions

embedding_function = embedding_functions.DefaultEmbeddingFunction()
retrieval_model = ChromadbRM(
            collection_name=database_name,
            persist_directory=CHROMA_DB_PATH,
            embedding_function=embedding_function,
        )

When I ran it, I didn’t need an authentification by OpenAI. However, I am also not running into authentification issues with embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2”) for one of my chroma DB. Could you exclude that the embedding function is the issue ?

SCR-20240406-qbeb
csaiedu commented 5 months ago

Hi Magdalena,

Thank you for your prompt response

When running your code sample, on windows 10, python 3.9 or 3.12, I get TypeError: init() got an unexpected keyword argument 'embedding_function'

Dspy version Name: dspy-ai Version: 2.4.0 Summary: DSPy

Chromadb version Name: chromadb Version: 0.4.24 Summary: Chroma.

Regards


From: Magdalena Lederbauer @.> Sent: Saturday, April 6, 2024 7:04 PM To: stanfordnlp/dspy @.> Cc: csaiedu @.>; Mention @.> Subject: Re: [stanfordnlp/dspy] ChromaDB minimal example (Issue #469)

@csaieduhttps://github.com/csaiedu Does perhaps this work for you? As per the ChromaDB documentation, “by default, Chroma uses the Sentence Transformershttps://www.sbert.net/ all-MiniLM-L6-v2 model to create embeddings”, which seems to be the embedding function you were using in the example.

from chromadb.utils import embedding_functions

embedding_function = embedding_functions.DefaultEmbeddingFunction() retrieval_model = ChromadbRM( collection_name=database_name, persist_directory=CHROMA_DB_PATH, embedding_function=embedding_function, )

When I ran it, I didn’t need an authentification by OpenAI. However, I am also not running into authentification issues with embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2”) for one of my chroma DB. Could you exclude that the embedding function is the issue ?

SCR-20240406-qbeb.png (view on web)https://github.com/stanfordnlp/dspy/assets/98785759/dea48cfb-c194-4a67-8e54-4e019d9c3c42

— Reply to this email directly, view it on GitHubhttps://github.com/stanfordnlp/dspy/issues/469#issuecomment-2041153060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJD5CFWHZ5P5222CI5BQTSDY4A2KDAVCNFSM6AAAAABD25KFZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGE2TGMBWGA. You are receiving this because you were mentioned.Message ID: @.***>

csaiedu commented 5 months ago

Never mind,

I checked your code on Linux and that works fine, it's the Windows OS that's still a problem for that library I imagine.

thanks for your help

Kind regards


From: Magdalena Lederbauer @.> Sent: Saturday, April 6, 2024 7:04 PM To: stanfordnlp/dspy @.> Cc: csaiedu @.>; Mention @.> Subject: Re: [stanfordnlp/dspy] ChromaDB minimal example (Issue #469)

@csaieduhttps://github.com/csaiedu Does perhaps this work for you? As per the ChromaDB documentation, “by default, Chroma uses the Sentence Transformershttps://www.sbert.net/ all-MiniLM-L6-v2 model to create embeddings”, which seems to be the embedding function you were using in the example.

from chromadb.utils import embedding_functions

embedding_function = embedding_functions.DefaultEmbeddingFunction() retrieval_model = ChromadbRM( collection_name=database_name, persist_directory=CHROMA_DB_PATH, embedding_function=embedding_function, )

When I ran it, I didn’t need an authentification by OpenAI. However, I am also not running into authentification issues with embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2”) for one of my chroma DB. Could you exclude that the embedding function is the issue ?

SCR-20240406-qbeb.png (view on web)https://github.com/stanfordnlp/dspy/assets/98785759/dea48cfb-c194-4a67-8e54-4e019d9c3c42

— Reply to this email directly, view it on GitHubhttps://github.com/stanfordnlp/dspy/issues/469#issuecomment-2041153060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJD5CFWHZ5P5222CI5BQTSDY4A2KDAVCNFSM6AAAAABD25KFZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGE2TGMBWGA. You are receiving this because you were mentioned.Message ID: @.***>

mlederbauer commented 5 months ago

No problem; Yes, I ran the code on MacOS; Great that it works now – let us know in case something else comes up!

csaiedu commented 5 months ago

thanks

On Mon 8 Apr 2024, 11:23 Magdalena Lederbauer, @.***> wrote:

No problem; Yes, I ran the code on MacOS; Great that it works now – let us know in case something else comes up!

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/dspy/issues/469#issuecomment-2042386678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJD5CFTVNXC74YNQISEAMIDY4JV2BAVCNFSM6AAAAABD25KFZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSGM4DMNRXHA . You are receiving this because you were mentioned.Message ID: @.***>