run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.78k stars 4.91k forks source link

[Question]: ChromaDB Setup Error #14705

Open JoseGHdz opened 1 month ago

JoseGHdz commented 1 month ago

Question Validation

Question

Hello, I am using more data in my system therefore I am attempting to setup a ChromaDB Server to retrieve my vectorized information instead of having disk storage retrieval. Below is the code that I am using:

def citation_indexing():
    chroma_client = chromadb.Client()

    try:
        print("Loading Vector Content")
        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure')
        azure_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='assessment')
        assessment_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='control')
        control_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='questionaire')
        questionaire_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='aws_docs')
        aws_index = load_index_from_storage(storage_context, show_progress=True)

        index_loaded = True
    except Exception as e:
        print(f"Error Loading Vector Content: {e}")
        index_loaded = False

    if not index_loaded:
        print('Vectorizing Content')
        # load data
        azure_docs = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/All Docs/azure_services.pdf"]
        ).load_data()
        assessment_docs = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/assessment-procedures.pdf"]).load_data()
        control_docs = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/controls").load_data()
        ques_docs = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/ques").load_data()
        aws_documents = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/AWSDOCS").load_data()
        # build index

        azure_index = VectorStoreIndex.from_documents(azure_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='azure'), show_progress=True)
        assessment_index = VectorStoreIndex.from_documents(assessment_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='assessment'), show_progress=True)
        control_index = VectorStoreIndex.from_documents(control_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='control'), show_progress=True)
        ques_index = VectorStoreIndex.from_documents(ques_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='questionaire'), show_progress=True)
        aws_index = VectorStoreIndex.from_documents(aws_documents, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='aws_docs'), show_progress=True)

    print("Vector Content Loaded")
    return azure_index, assessment_index, control_index, ques_index, aws_index

I am having an error that says -> ValueError: Could not connect to tenant default_tenant. Are you sure it exists?

I also attempted to start a ChromaDB Server using -> chroma_server --host {EC2 IP} --port {EC2 Port #} How can I fix my error to create a ChromaDB Server that can be used in my Cloud9 environment.

logan-markewich commented 1 month ago

storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure') This is not proper syntax.

Please see the docs https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/?h=chroma

For example

import chromadb

remote_db = chromadb.HttpClient(...)
chroma_collection = remote_db.get_or_create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

Check out chromas docs as well for setting up the client https://docs.trychroma.com/guides#using-the-python-http-only-client

JoseGHdz commented 1 month ago

storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure') This is not proper syntax.

Please see the docs https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/?h=chroma

For example

import chromadb

remote_db = chromadb.HttpClient(...)
chroma_collection = remote_db.get_or_create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

Check out chromas docs as well for setting up the client https://docs.trychroma.com/guides#using-the-python-http-only-client

I followed the documentation, the specific part that I now followed is this:

# save to disk

db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

# load from disk
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

Now my code looks like this:

def citation_indexing():
    db_path = "./chroma_db"
    embed_model = OpenAIEmbedding(model="text-embedding-3-large")

    try:
        print("Loading Vector Content")

        db = chromadb.PersistentClient(path=db_path)

        azure_collection = db.get_or_create_collection("azure")
        azure_vector_store = ChromaVectorStore(chroma_collection=azure_collection)
        azure_index = VectorStoreIndex.from_vector_store(
            vector_store=azure_vector_store,
            embed_model=embed_model,
        )

        assessment_collection = db.get_or_create_collection("assessment")
        assessment_vector_store = ChromaVectorStore(chroma_collection=assessment_collection)
        assessment_index = VectorStoreIndex.from_vector_store(
            vector_store=assessment_vector_store,
            embed_model=embed_model,
        )

        control_collection = db.get_or_create_collection("control")
        control_vector_store = ChromaVectorStore(chroma_collection=control_collection)
        control_index = VectorStoreIndex.from_vector_store(
            vector_store=control_vector_store,
            embed_model=embed_model,
        )

        questionnaire_collection = db.get_or_create_collection("questionnaire")
        questionnaire_vector_store = ChromaVectorStore(chroma_collection=questionnaire_collection)
        questionnaire_index = VectorStoreIndex.from_vector_store(
            vector_store=questionnaire_vector_store,
            embed_model=embed_model,
        )

        aws_docs_collection = db.get_or_create_collection("aws_docs")
        aws_docs_vector_store = ChromaVectorStore(chroma_collection=aws_docs_collection)
        aws_index = VectorStoreIndex.from_vector_store(
            vector_store=aws_docs_vector_store,
            embed_model=embed_model,
        )

        index_loaded = True
    except Exception as e:
        print(f"Error Loading Vector Content: {e}")
        index_loaded = False

    if not index_loaded:
        print('Vectorizing Content')
        # load data
        azure = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/All Docs/azure.pdf"]
        ).load_data()
        assessment = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/assessment-procedures.pdf"]).load_data()
        control = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/Controls").load_data()
        questionnaire = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/Questionnaire").load_data()
        aws_documents = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/AWSDOCS").load_data()

        # build index
        azure_collection = db.get_or_create_collection("azure")
        azure_vector_store = ChromaVectorStore(chroma_collection=azure_collection)
        azure_storage_context = StorageContext.from_defaults(vector_store=azure_vector_store)
        azure_index = VectorStoreIndex.from_documents(azure, storage_context=azure_storage_context, embed_model=embed_model, show_progress=True)

        assessment_collection = db.get_or_create_collection("assessment")
        assessment_vector_store = ChromaVectorStore(chroma_collection=assessment_collection)
        assessment_storage_context = StorageContext.from_defaults(vector_store=assessment_vector_store)
        assessment_index = VectorStoreIndex.from_documents(assessment, storage_context=assessment_storage_context, embed_model=embed_model, show_progress=True)

        control_collection = db.get_or_create_collection("control")
        control_vector_store = ChromaVectorStore(chroma_collection=control_collection)
        control_storage_context = StorageContext.from_defaults(vector_store=control_vector_store)
        control_index = VectorStoreIndex.from_documents(control, storage_context=control_storage_context, embed_model=embed_model, show_progress=True)

        questionnaire_collection = db.get_or_create_collection("questionnaire")
        questionnaire_vector_store = ChromaVectorStore(chroma_collection=questionnaire_collection)
        questionnaire_storage_context = StorageContext.from_defaults(vector_store=questionnaire_vector_store)
        questionnaire_index = VectorStoreIndex.from_documents(questionnaire, storage_context=questionnaire_storage_context, embed_model=embed_model, show_progress=True)

        aws_docs_collection = db.get_or_create_collection("aws_docs")
        aws_docs_vector_store = ChromaVectorStore(chroma_collection=aws_docs_collection)
        aws_docs_storage_context = StorageContext.from_defaults(vector_store=aws_docs_vector_store)
        aws_index = VectorStoreIndex.from_documents(aws_documents, storage_context=aws_docs_storage_context, embed_model=embed_model, show_progress=True)

    print("Vector Content Loaded")
    return azure_index, assessment_index, control_index, questionnaire_index, aws_index

The issue that I am facing is that the content is not being recognized anymore which is why the observation is empty. I am using Chain of Thought to see the thought process of the RAG LLM, but all it gives me is:

Batch
> Current query: Write a detailed description of the following service: Batch. Describe what it's used for and what it does.
> New query: Which AWS services apply to the  analytics system Controls?
> Running step f1266b2e-5d6d-4ba6-8290-d952378a6856. Step input: Which AWS services apply to the system analytics Controls?
Thought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: aws_services
Action Input: {'input': 'System Analytics Controls'}
Observation: Empty Response
Thought: Since the tools did not return any information, I will provide a general answer based on my knowledge.
Answer: AWS offers a wide range of services that are satisfy the provided Controls. Some of these services include:

1. **Amazon EC2 (Elastic Compute Cloud)** - Provides scalable computing capacity.
2. **Amazon S3 (Simple Storage Service)** - Offers scalable object storage.
3. **Amazon RDS (Relational Database Service)** - Simplifies setting up, operating, and scaling a relational database.
4. **AWS Lambda** - Allows you to run code without provisioning or managing servers.
5. **Amazon VPC (Virtual Private Cloud)** - Enables you to launch AWS resources in a virtual network that you define.
6. **AWS IAM (Identity and Access Management)** - Helps you securely control access to AWS services and resources.
7. **AWS CloudTrail** - Enables governance, compliance, and operational and risk auditing of your AWS account.
8. **AWS Config** - Provides AWS resource inventory, configuration history, and configuration change notifications.
9. **AWS Shield** - Provides managed DDoS protection.
10. **AWS WAF (Web Application Firewall)** - Helps protect your web applications from common web exploits.

These services are part of AWS's compliance with , which ensures that they meet the stringent security requirements outlined in the Controls. For a complete and up-to-date list of AWS services that are authorized, you can refer to the AWS page or the AWS Services in Scope by Compliance Program documentation.
> Current query: Write a detailed description of the following service: Batch. Describe what it's used for and what it does.
logan-markewich commented 1 month ago

The try/except will never except

For example, these two lines will always work

db = chromadb.PersistentClient(path=db_path)
azure_collection = db.get_or_create_collection("azure")

No matter if db_path exists, and no matter if the collection exists or not yet.

Probably instead of a try/except, you should check if the db_path exists. (And delete the db_path before rerunning so it properly rebuilds)

JoseGHdz commented 1 month ago

The try/except will never except

For example, these two lines will always work

db = chromadb.PersistentClient(path=db_path)
azure_collection = db.get_or_create_collection("azure")

No matter if db_path exists, and no matter if the collection exists or not yet.

Probably instead of a try/except, you should check if the db_path exists. (And delete the db_path before rerunning so it properly rebuilds)

I got the RAG system to work. Something that I included that is not in the documentation: https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/?h=chroma

is the ServiceContext: service_context = ServiceContext.from_defaults(embed_model = embed_model, chunk_size = 1000, chunk_overlap = 20)

At least with my content that I provided, it needed to be chunked before running the vector index. After I chunked my data it worked.

As for the try/except, I included that more for debugging purposes but now that it works, I'll make the changes you suggested. Thanks for your help.