langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
11.79k stars 1.97k forks source link

Chroma vectorstore.similarity search is not returning results when chroma client is directly #5519

Open justinlevi opened 1 month ago

justinlevi commented 1 month ago

Checked other resources

Example Code

Here is my code:

import { ChromaClient, DefaultEmbeddingFunction } from "chromadb";
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";

const url = "http://localhost:8010";
const query = "What is this library about?";

// THIS DOES NOT WORK
async function fetchDocumentsWithLangchain() {
  try {
    const vectorStore = new Chroma(new OpenAIEmbeddings(), {
      url,
    });

    // Perform a similarity search
    const response = await vectorStore.similaritySearch(query, 2);

    // Log and send the response
    console.log("Retrieved Documents:", response);
  } catch (error) {
    console.error("Error fetching Langchain vectors:", error);
  }
}

// THIS WORKS
const fetchDocumentsWithChromaDBClient = async () => {
  const client = new ChromaClient({
    path: url,
  });

  const collections = await client.listCollections();
  console.log("🚀 ~ collections", collections);

  const ef = new DefaultEmbeddingFunction();

  const collection = await client.getCollection({
    name: "pyansys",
    embeddingFunction: ef,
  });

  console.log(JSON.stringify(collection, null, 2));

  const results = await collection.query({
    queryTexts: [query],
    nResults: 2,
    // where: {"metadata_field": "is_equal_to_this"}, // optional filter
    // whereDocument: {"$contains":"search_string"} // optional filter
  });

  console.log("🚀 ~ results", results);
};

async function run() {
  // This works
  fetchDocumentsWithChromaDBClient()

  // This does not work
  fetchDocumentsWithLangchain()
}

run();

results in the following output:

🚀 ~ collections [
  {
    name: 'langchain-b636db29-3d63-47c0-a822-e345b88035df',
    id: '687b22a5-81e3-4d77-9b4b-a4e865445bac',
    metadata: null,
    tenant: 'default_tenant',
    database: 'default_database'
  },
  {
    name: 'langchain-b532f2ca-54ce-4d29-ba6f-95b9356a42ad',
    id: '868be5d2-9122-4418-bc41-4e6c199e48bd',
    metadata: null,
    tenant: 'default_tenant',
    database: 'default_database'
  },
  {
    name: 'pyansys',
    id: 'cb1b5a68-ce47-431d-b2db-0210dfedcc76',
    metadata: null,
    tenant: 'default_tenant',
    database: 'default_database'
  },
  {
    name: 'langchain-4ca4e812-cb89-4ef0-935b-27993cf98d4b',
    id: 'd96ba9df-44f0-453d-98a3-101ae772db61',
    metadata: null,
    tenant: 'default_tenant',
    database: 'default_database'
  }
]
{
  "name": "pyansys",
  "id": "cb1b5a68-ce47-431d-b2db-0210dfedcc76",
  "metadata": null,
  "api": {
    "basePath": "http://localhost:8010",
    "configuration": {
      "basePath": "http://localhost:8010"
    },
    "options": {}
  },
  "embeddingFunction": {
    "model": "Xenova/all-MiniLM-L6-v2",
    "revision": "main",
    "quantized": false,
    "progress_callback": null
  }
}
Retrieved Documents: []
🚀 ~ results {
  ids: [
    [
      'ad3b38b2-17a6-11ef-9997-f2d13e4e5577',
      '3349b7e6-17aa-11ef-9997-f2d13e4e5577'
    ]
  ],
  distances: [ [ 1.4828091937878527, 1.4828091937878527 ] ],
  metadatas: [ [ [Object], [Object] ] ],
  embeddings: null,
  documents: [
    [
      '{"page_content": "PyDPF-Core \\u2014 XXX-Core\\n\\nprint(model)  \\nHere is how you plot XX results:  \\[] OMITTED Languages", "XXX", "XX-X"]}, "type": "Document"}'
    ]
  ],
  uris: null,
  data: null,
  included: [ 'metadatas', 'documents', 'distances' ]
}

Note that using the ChromaClient directly results in documents being returned whereas the Langchain Chroma class returns zero results.

The following python code works as expected as well:

from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from IPython.display import Markdown
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

llm = ChatOpenAI(model='gpt-4-turbo', temperature=0, verbose=True)
embeddings = OpenAIEmbeddings()
query = "What library about?"
client = chromadb.HttpClient(host="localhost", port=8010, settings=Settings(allow_reset=True))
persistent_client = chromadb.PersistentClient()
db = Chroma(client=persistent_client, collection_name="pyansys", embedding_function=embeddings)

# Retrieve and generate using the relevant snippets of the blog.
retriever2 = db.as_retriever()
# prompt = hub.pull("rlm/rag-prompt")
prompt = PromptTemplate.from_template("""
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Try to keep the answer concise. If the answer has code, ensure you provide the code in a markdown code block format. Add the source URLs provided in the context at the end of the answer. 
Question: {question} 
Context: {context} 
Answer:
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain2 = (
    {"context": retriever2 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

display(rag_chain2.invoke(query))

Error Message and Stack Trace (if applicable)

None

Description

I'm just trying to get a basic RAG example working with the chroma db vectorstore

System Info

"yarn info langchain" "dependencies": { "@langchain/community": "^0.2.1", "@langchain/openai": "^0.0.33", "@types/node": "^20.12.12", "chromadb": "1.7.3-beta5", "chromadb-default-embed": "^2.13.2", "langchain": "^0.2.0", "typescript": "^5.4.5" },

langchain-ts-project 🚀 node --version
v21.7.1

Mac

dosubot[bot] commented 1 month ago

To resolve the issue where the LangChain Chroma class does not return any results while the direct Chroma client works correctly for similarity search, ensure the following:

  1. Correct Collection Name: Make sure the collection name used in the Chroma class matches the one used in the direct Chroma client. In your example, the collection name is "pyansys".

  2. Check Embeddings: Verify that the embeddings are correctly generated and used in the similarity search. The OpenAIEmbeddings class should be properly instantiated.

Here is an updated version of your code with these considerations:

import { ChromaClient, DefaultEmbeddingFunction } from "chromadb";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";

const url = "http://localhost:8010";
const query = "What is this library about?";

// THIS DOES NOT WORK
async function fetchDocumentsWithLangchain() {
  try {
    const vectorStore = new Chroma(new OpenAIEmbeddings(), {
      url,
      collectionName: "pyansys", // Ensure the collection name matches
    });

    // Perform a similarity search
    const response = await vectorStore.similaritySearch(query, 2);

    // Log and send the response
    console.log("Retrieved Documents:", response);
  } catch (error) {
    console.error("Error fetching Langchain vectors:", error);
  }
}

// THIS WORKS
const fetchDocumentsWithChromaDBClient = async () => {
  const client = new ChromaClient({
    path: url,
  });

  const collections = await client.listCollections();
  console.log("🚀 ~ collections", collections);

  const ef = new DefaultEmbeddingFunction();

  const collection = await client.getCollection({
    name: "pyansys",
    embeddingFunction: ef,
  });

  console.log(JSON.stringify(collection, null, 2));

  const results = await collection.query({
    queryTexts: [query],
    nResults: 2,
    // where: {"metadata_field": "is_equal_to_this"}, // optional filter
    // whereDocument: {"$contains":"search_string"} // optional filter
  });

  console.log("🚀 ~ results", results);
};

async function run() {
  // This works
  await fetchDocumentsWithChromaDBClient();

  // This does not work
  await fetchDocumentsWithLangchain();
}

run();

Additional Debugging Steps

  1. Verify Embeddings: Ensure that the OpenAIEmbeddings class is correctly generating embeddings. You can add logging to check the embeddings generated for the query.

  2. Check Collection Initialization: Ensure that the collection is correctly initialized in the Chroma class. You can add logging to verify the collection details.

  3. Mocking and Testing: Use the following test case to mock the query method and verify that the similaritySearchVectorWithScore method returns the expected results:

import { jest, test, expect } from "@jest/globals";
import { type Collection } from "chromadb";
import { FakeEmbeddings } from "@langchain/core/utils/testing";
import { Chroma } from "../chroma.js";

const mockCollection = {
  query: jest.fn<Collection["query"]>().mockResolvedValue({
    ids: [["0", "1", "2", "3", "4"]],
    distances: [[0.1, 0.2, 0.3, 0.4, 0.5]],
    documents: [
      ["Document 1", "Document 2", "Document 3", "Document 4", "Document 5"],
    ],
    metadatas: [[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }]],
  } as any),
} as any;

const mockClient = {
  getOrCreateCollection: jest.fn<any>().mockResolvedValue(mockCollection),
} as any;

describe("Chroma", () => {
  beforeEach(() => {
    jest.clearAllMocks();
  });

  test("should perform similarity search and return results", async () => {
    const args = { collectionName: "pyansys" };
    const query = [1, 2];
    const expectedResultCount = 5;
    mockCollection.query = jest.fn<Collection["query"]>().mockResolvedValue({
      ids: [["0", "1", "2", "3", "4"]],
      distances: [[0.1, 0.2, 0.3, 0.4, 0.5]],
      documents: [
        ["Document 1", "Document 2", "Document 3", "Document 4", "Document 5"],
      ],
      metadatas: [[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }]],
    } as any);

    const chroma = new Chroma(new FakeEmbeddings(), args);
    chroma.collection = mockCollection;

    const results = await chroma.similaritySearchVectorWithScore(
      query,
      expectedResultCount
    );

    expect(mockCollection.query).toHaveBeenCalledWith({
      queryEmbeddings: query,
      nResults: expectedResultCount,
      where: {},
    });
    expect(results).toHaveLength(5);
  });
});

By following these steps and ensuring that the collection name and embeddings are correctly configured, you should be able to resolve the issue with the LangChain Chroma class not returning any results.

#### Details **Sources** - [Stumped when using `PineconeStore.fromExstingIndex()` to create and query a vectorstore. No error, no results. Pinecone client query works fine.](https://github.com/langchain-ai/langchainjs/discussions/5074) - [Allow ChromaClient to be initialized outside Chroma Vector Store](https://github.com/langchain-ai/langchainjs/issues/211) - [libs/langchain-community/src/vectorstores/tests/chroma.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/vectorstores/tests/chroma.test.ts) - [libs/langchain-community/src/vectorstores/tests/chroma.int.test.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/vectorstores/tests/chroma.int.test.ts) - [examples/src/indexes/vector_stores/chroma/search.ts](https://github.com/langchain-ai/langchainjs/blob/main/examples/src/indexes/vector_stores/chroma/search.ts) - [langchain/src/retrievers/self_query/chroma.ts](https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/retrievers/self_query/chroma.ts) - [libs/langchain-community/src/vectorstores/chroma.ts](https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-community/src/vectorstores/chroma.ts) - [examples/src/retrievers/chroma_self_query.ts](https://github.com/langchain-ai/langchainjs/blob/main/examples/src/retrievers/chroma_self_query.ts)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

jeasonnow commented 1 month ago

Did you pass the model field for embedding? If not, please try again:

const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-large",
});
justinlevi commented 1 month ago
image

Same result. Querying the chroma db vectorstore collection directly with the chromadb.ChromaClient works as expected. Querying the vectorstore through the langchain Chroma class via the similarity search returns zero results.

// This still returns zero results

const vectorStore = new Chroma(
      new OpenAIEmbeddings({ model: "text-embedding-3-large"}),
      {
        url,
      }
    );

    // Perform a similarity search
    const response = await vectorStore.similaritySearch(query, 2);

The chroma collection was created and populated from a python script and now I'm trying to access it via typescript.

In my colab notebook, the following Python query works totally fine as well:

from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from IPython.display import Markdown
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

llm = ChatOpenAI(model='gpt-4-turbo', temperature=0, verbose=True)
embeddings = OpenAIEmbeddings()
query = "What library about?"
client = chromadb.HttpClient(host="localhost", port=8010, settings=Settings(allow_reset=True))
persistent_client = chromadb.PersistentClient()
db = Chroma(client=persistent_client, collection_name="pyansys", embedding_function=embeddings)

# Retrieve and generate using the relevant snippets of the blog.
retriever2 = db.as_retriever()
# prompt = hub.pull("rlm/rag-prompt")
prompt = PromptTemplate.from_template("""
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Try to keep the answer concise. If the answer has code, ensure you provide the code in a markdown code block format. Add the source URLs provided in the context at the end of the answer. 
Question: {question} 
Context: {context} 
Answer:
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain2 = (
    {"context": retriever2 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

display(rag_chain2.invoke(query))
  "dependencies": {
    "@langchain/community": "^0.2.1",
    "@langchain/openai": "^0.0.33",
    "@types/node": "^20.12.12",
    "chromadb": "1.7.3-beta5",
    "chromadb-default-embed": "^2.13.2",
    "langchain": "^0.2.0",
    "typescript": "^5.4.5"
  },

Why would this work in Python but not in Typescript? Trying to understand what's different.

Here is my package.json I needed to upgrade "chromadb": "1.7.3-beta5" as I was getting a dimensions mismatch with older versions. Could that be the issue here somehow?

jeasonnow commented 1 month ago

1.7.3 release

https://github.com/chroma-core/chroma/blob/e0c5d441d7f14971eeaedf89a6b929986f60d327/clients/js/src/ChromaClient.ts#L183-L208

latest release

https://github.com/chroma-core/chroma/blob/f8bc50a46162390d466281528d085c7303c0d219/clients/js/src/ChromaClient.ts#L213-L241

I looked at the code for chroma@1.7.3 release, and your guess might be right. In the older versions of chroma, DefaultEmbeddingFunction was not passed by default, which could potentially result in the inability to use embedding queries properly.

https://github.com/langchain-ai/langchainjs/blob/96324b5a2f473dde7853522884369805b9aa848a/libs/langchain-community/src/vectorstores/chroma.ts#L115-L122

Unfortunately, the Chroma implementation in langchain-community seems to be based on the latest version. When initializing the client, there is no configuration option for embeddingFunction. Perhaps I can add a configuration option to solve this problem.

jeasonnow commented 1 month ago

Perhaps my guess was incorrect. After carefully examining all the implementations, the current code is actually consistent with langchain python. I don't need to add an embedding function additionally. In chromadb, it only takes effect when using queryTexts. However, I found an issue with your code.

const vectorStore = new Chroma(new OpenAIEmbeddings(), {
      url,
      collectionName: 'pyansys'
});

Maybe you should specify the collection you are querying. If you don't specify it, langchain will automatically generate a brand new collection and perform the query. Your implementation in python is correct:

llm = ChatOpenAI(model='gpt-4-turbo', temperature=0, verbose=True)
embeddings = OpenAIEmbeddings()
query = "What library about?"
client = chromadb.HttpClient(host="localhost", port=8010, settings=Settings(allow_reset=True))
persistent_client = chromadb.PersistentClient()
db = Chroma(client=persistent_client, collection_name="pyansys", embedding_function=embeddings)
justinlevi commented 1 month ago

Yes, I also noticed that bug in my code as well. After fixing that, I created a simplified repo to try and more clearly identify the issue.

https://github.com/justinlevi/ts-langchain-chromadb

After setting this up, I noticed that each time I was seeding the database, I'd see a chroma folder getting created on my host machine each time I was trying to seed the database. It seems that something I was doing was perhaps creating two client connections, one to the remote as well as a new one locally???

This version of my python code now seems to work as expected:

client = chromadb.HttpClient(host="localhost", port=8000, settings=Settings(allow_reset=True))
collection = client.get_or_create_collection("demo")
embeddings_function = OpenAIEmbeddings()

Chroma().from_documents(
    docs, embeddings_function, client=client, collection_name="demo"
)

This just removes the persistent_client.

I can now run my typescript code with the latest versions of langchain and get documents returned from the remote chromadb connection.

What is very strange though is that now my chroma db client fails with the following error:

🚀 ~ results { error: 'InvalidDimension' }

I was not getting this error with "chromadb": "1.7.3-beta5",?? My goal was to get this to work with langchain js though so I guess I'm just going to let this go as a mystery for now. This remaining issue also seems more related to the chromadb nodejs client library at this point so no need to dive in further in this issue.

@jeasonnow appreciate you diving into this with me. Hopefully that github repo will help someone else in the future

jacoblee93 commented 1 month ago

Hey apologies for losing track of this!

There were a few issues with the Chroma JS library in the past, and as a result we set the peer dependency to * until we could add a separately installable integration package to avoid breaking the broader @langchain/community package. This is likely to cause issues as Chroma makes breaking changes.

Will prioritize that split package and update this issue when that's done and ready!