langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.19k stars 15.44k forks source link

Chroma similarity_search and similarity_search_with_score do not return any results #27273

Open guninder opened 1 month ago

guninder commented 1 month ago

Checked other resources

Example Code

Hi, I am new to langchain and chroma. I am trying to insert data into chromadb and search it. There is no issue with data. I tried the same search in creating a knowledge base in bedrock. I don't get any error. The database created (data_level0.bin is about 6.3 MB) but while doing a search, it returns empty results. Following is the code to insert the data.

from langchain_chroma import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import os

CHROMA_PATH = "data/chroma_wp"
os.environ["OPENAI_API_KEY"] = "sk-"

loader = TextLoader("books/war_and_peace.txt", encoding="utf-8")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200, separator="\n")

chunks = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectorStore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=CHROMA_PATH)```

Following is the code i am using to search.
```import chromadb
import os
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

os.environ["OPENAI_API_KEY"] = "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"
CHROMA_PATH = "data/chroma_wp"

embeddings = OpenAIEmbeddings()
vectorStore = Chroma(persist_directory=CHROMA_PATH, embedding_function=embeddings)
#vectorStore.delete()
print(vectorStore)

results = vectorStore.similarity_search("Who is Andrew?", k=3)
#vectorStore.similarity_search_with_score("Who is Andrew?", k=3)

print(results)

I get empty results.

Following are the packages i am using

langchain 0.3.1

langchain-chroma 0.1.4

langchain-community 0.3.1

langchain-core 0.3.6

langchain-experimental 0.3.2

langchain-openai 0.2.1

langchain-text-splitters 0.3.0

chroma-hnswlib 0.7.6

chromadb 0.5.12

Error Message and Stack Trace (if applicable)

No exception. Just empty results.

Description

Hi, I am new to langchain and chroma. I am trying to insert data into chromadb and search it. There is no issue with data. I tried the same search in creating a knowledge base in bedrock. I don't get any error. The database created (data_level0.bin is about 6.3 MB) but while doing a search, it returns empty results. Following is the code to insert the data.

from langchain_chroma import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import os

CHROMA_PATH = "data/chroma_wp"
os.environ["OPENAI_API_KEY"] = "sk-"

loader = TextLoader("books/war_and_peace.txt", encoding="utf-8")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200, separator="\n")

chunks = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectorStore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=CHROMA_PATH)```

Following is the code i am using to search.
```import chromadb
import os
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

os.environ["OPENAI_API_KEY"] = "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"
CHROMA_PATH = "data/chroma_wp"

embeddings = OpenAIEmbeddings()
vectorStore = Chroma(persist_directory=CHROMA_PATH, embedding_function=embeddings)
#vectorStore.delete()
print(vectorStore)

results = vectorStore.similarity_search("Who is Andrew?", k=3)
#vectorStore.similarity_search_with_score("Who is Andrew?", k=3)

print(results)

I get empty results.

Following are the packages i am using

langchain 0.3.1

langchain-chroma 0.1.4

langchain-community 0.3.1

langchain-core 0.3.6

langchain-experimental 0.3.2

langchain-openai 0.2.1

langchain-text-splitters 0.3.0

chroma-hnswlib 0.7.6

chromadb 0.5.12

System Info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]

Package Information

langchain_core: 0.3.6 langchain: 0.3.1 langchain_community: 0.3.1 langsmith: 0.1.129 langchain_chroma: 0.1.4 langchain_experimental: 0.3.2 langchain_openai: 0.2.1 langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.6 async-timeout: 4.0.3 chromadb: 0.5.12 dataclasses-json: 0.6.7 fastapi: 0.115.0 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.50.1 orjson: 3.10.7 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.5.2 PyYAML: 6.0.2 requests: 2.32.3 SQLAlchemy: 2.0.35 tenacity: 8.5.0 tiktoken: 0.7.0 typing-extensions: 4.12.2

iharshlalakiya commented 1 month ago

hi, I can help you with this problem.

guninder commented 1 month ago

iharshlalakiya, thank you. Will appreciate. Please let me know if you need any other information.

gauravprasadgp commented 3 weeks ago

@guninder is this issue resolved ? if not then use the below code:

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

below code creates the embedding of your query query = embeddings.embed_query("Who is Andrew?")

it then performs the vector similarity search against the embedding generated for the query. results = vectorStore. similarity_search_by_vector_with_relevance_scores(embedding=query, k=3)

guninder commented 3 weeks ago

@gauravprasadgp , Thanks but that doesn't seem to be right. I had tried externalizing OpenAIEmbeddings function and using the same by importing it while writing and reading. I believe default model is text-embedding-ada-002. If i use any other, it throws exception. Most likely this is a version problem.

iharshlalakiya commented 1 week ago

@guninder, Try this below code:

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os
import logging

#Set up logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

CHROMA_PATH = "data/chroma_wp"
os.environ["OPENAI_API_KEY"] = "your-api-key-here"  # Replace with your API key

def create_vector_store():
try:
    # 1. Load the document
    logger.info("Loading document...")
    loader = TextLoader("books/war_and_peace.txt", encoding="utf-8")
    documents = loader.load()
    logger.info(f"Loaded {len(documents)} documents")

    # 2. Split the documents
    logger.info("Splitting documents into chunks...")
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200, separator="\n")
    chunks = text_splitter.split_documents(documents)
    logger.info(f"Created {len(chunks)} chunks")

    # 3. Create embeddings and store in Chroma
    logger.info("Creating embeddings and storing in Chroma...")
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=CHROMA_PATH
    )

    # 4. Persist the database
    logger.info("Persisting the database...")
    vector_store.persist()
    logger.info(f"Database persisted to {CHROMA_PATH}")

    return vector_store

except Exception as e:
    logger.error(f"Error in create_vector_store: {str(e)}")
    raise

def search_vector_store():
try:
    # 1. Check if database exists
    if not os.path.exists(CHROMA_PATH):
        logger.error(f"Database directory {CHROMA_PATH} does not exist!")
        return

    logger.info("Initializing embeddings...")
    embeddings = OpenAIEmbeddings()

    # 2. Load the persisted database
    logger.info("Loading the persisted database...")
    vector_store = Chroma(
        persist_directory=CHROMA_PATH,
        embedding_function=embeddings
    )

    # 3. Get collection info
    collection = vector_store._collection
    logger.info(f"Collection count: {collection.count()}")

    # 4. Perform the search
    query = "Who is Andrew?"
    logger.info(f"Performing search with query: '{query}'")
    results = vector_store.similarity_search(query=query, k=3)

    # 5. Print results
    if results:
        logger.info(f"Found {len(results)} results")
        for i, doc in enumerate(results, 1):
            logger.info(f"Result {i}:")
            logger.info(f"Content: {doc.page_content[:200]}...")
            logger.info(f"Metadata: {doc.metadata}")
    else:
        logger.warning("No results found!")

    return results

except Exception as e:
    logger.error(f"Error in search_vector_store: {str(e)}")
    raise

def main():
    # Create database directory if it doesn't exist
    os.makedirs(CHROMA_PATH, exist_ok=True)

    # First time: create and populate the database
    if not os.listdir(CHROMA_PATH):
        logger.info("Creating new vector store...")
        create_vector_store()

    # Search the database
    logger.info("Searching the vector store...")
    results = search_vector_store()

    return results

if __name__ == "__main__":
    main()