Open justinlevi opened 1 month ago
To resolve the issue where the LangChain Chroma class does not return any results while the direct Chroma client works correctly for similarity search, ensure the following:
Correct Collection Name: Make sure the collection name used in the Chroma
class matches the one used in the direct Chroma client. In your example, the collection name is "pyansys".
Check Embeddings: Verify that the embeddings are correctly generated and used in the similarity search. The OpenAIEmbeddings
class should be properly instantiated.
Here is an updated version of your code with these considerations:
import { ChromaClient, DefaultEmbeddingFunction } from "chromadb";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Chroma } from "@langchain/community/vectorstores/chroma";
const url = "http://localhost:8010";
const query = "What is this library about?";
// THIS DOES NOT WORK
async function fetchDocumentsWithLangchain() {
try {
const vectorStore = new Chroma(new OpenAIEmbeddings(), {
url,
collectionName: "pyansys", // Ensure the collection name matches
});
// Perform a similarity search
const response = await vectorStore.similaritySearch(query, 2);
// Log and send the response
console.log("Retrieved Documents:", response);
} catch (error) {
console.error("Error fetching Langchain vectors:", error);
}
}
// THIS WORKS
const fetchDocumentsWithChromaDBClient = async () => {
const client = new ChromaClient({
path: url,
});
const collections = await client.listCollections();
console.log("🚀 ~ collections", collections);
const ef = new DefaultEmbeddingFunction();
const collection = await client.getCollection({
name: "pyansys",
embeddingFunction: ef,
});
console.log(JSON.stringify(collection, null, 2));
const results = await collection.query({
queryTexts: [query],
nResults: 2,
// where: {"metadata_field": "is_equal_to_this"}, // optional filter
// whereDocument: {"$contains":"search_string"} // optional filter
});
console.log("🚀 ~ results", results);
};
async function run() {
// This works
await fetchDocumentsWithChromaDBClient();
// This does not work
await fetchDocumentsWithLangchain();
}
run();
Verify Embeddings: Ensure that the OpenAIEmbeddings
class is correctly generating embeddings. You can add logging to check the embeddings generated for the query.
Check Collection Initialization: Ensure that the collection is correctly initialized in the Chroma
class. You can add logging to verify the collection details.
Mocking and Testing: Use the following test case to mock the query
method and verify that the similaritySearchVectorWithScore
method returns the expected results:
import { jest, test, expect } from "@jest/globals";
import { type Collection } from "chromadb";
import { FakeEmbeddings } from "@langchain/core/utils/testing";
import { Chroma } from "../chroma.js";
const mockCollection = {
query: jest.fn<Collection["query"]>().mockResolvedValue({
ids: [["0", "1", "2", "3", "4"]],
distances: [[0.1, 0.2, 0.3, 0.4, 0.5]],
documents: [
["Document 1", "Document 2", "Document 3", "Document 4", "Document 5"],
],
metadatas: [[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }]],
} as any),
} as any;
const mockClient = {
getOrCreateCollection: jest.fn<any>().mockResolvedValue(mockCollection),
} as any;
describe("Chroma", () => {
beforeEach(() => {
jest.clearAllMocks();
});
test("should perform similarity search and return results", async () => {
const args = { collectionName: "pyansys" };
const query = [1, 2];
const expectedResultCount = 5;
mockCollection.query = jest.fn<Collection["query"]>().mockResolvedValue({
ids: [["0", "1", "2", "3", "4"]],
distances: [[0.1, 0.2, 0.3, 0.4, 0.5]],
documents: [
["Document 1", "Document 2", "Document 3", "Document 4", "Document 5"],
],
metadatas: [[{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }]],
} as any);
const chroma = new Chroma(new FakeEmbeddings(), args);
chroma.collection = mockCollection;
const results = await chroma.similaritySearchVectorWithScore(
query,
expectedResultCount
);
expect(mockCollection.query).toHaveBeenCalledWith({
queryEmbeddings: query,
nResults: expectedResultCount,
where: {},
});
expect(results).toHaveLength(5);
});
});
By following these steps and ensuring that the collection name and embeddings are correctly configured, you should be able to resolve the issue with the LangChain Chroma class not returning any results.
To continue the conversation, mention @dosu.
Did you pass the model
field for embedding
? If not, please try again:
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-large",
});
Same result. Querying the chroma db vectorstore collection directly with the chromadb.ChromaClient works as expected. Querying the vectorstore through the langchain Chroma class via the similarity search returns zero results.
// This still returns zero results
const vectorStore = new Chroma(
new OpenAIEmbeddings({ model: "text-embedding-3-large"}),
{
url,
}
);
// Perform a similarity search
const response = await vectorStore.similaritySearch(query, 2);
The chroma collection was created and populated from a python script and now I'm trying to access it via typescript.
In my colab notebook, the following Python query works totally fine as well:
from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from IPython.display import Markdown
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
llm = ChatOpenAI(model='gpt-4-turbo', temperature=0, verbose=True)
embeddings = OpenAIEmbeddings()
query = "What library about?"
client = chromadb.HttpClient(host="localhost", port=8010, settings=Settings(allow_reset=True))
persistent_client = chromadb.PersistentClient()
db = Chroma(client=persistent_client, collection_name="pyansys", embedding_function=embeddings)
# Retrieve and generate using the relevant snippets of the blog.
retriever2 = db.as_retriever()
# prompt = hub.pull("rlm/rag-prompt")
prompt = PromptTemplate.from_template("""
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Try to keep the answer concise. If the answer has code, ensure you provide the code in a markdown code block format. Add the source URLs provided in the context at the end of the answer.
Question: {question}
Context: {context}
Answer:
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain2 = (
{"context": retriever2 | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
display(rag_chain2.invoke(query))
"dependencies": {
"@langchain/community": "^0.2.1",
"@langchain/openai": "^0.0.33",
"@types/node": "^20.12.12",
"chromadb": "1.7.3-beta5",
"chromadb-default-embed": "^2.13.2",
"langchain": "^0.2.0",
"typescript": "^5.4.5"
},
Why would this work in Python but not in Typescript? Trying to understand what's different.
Here is my package.json I needed to upgrade "chromadb": "1.7.3-beta5" as I was getting a dimensions mismatch with older versions. Could that be the issue here somehow?
I looked at the code for chroma@1.7.3 release
, and your guess might be right. In the older versions of chroma
, DefaultEmbeddingFunction
was not passed by default, which could potentially result in the inability to use embedding
queries properly.
Unfortunately, the Chroma
implementation in langchain-community
seems to be based on the latest version. When initializing the client
, there is no configuration option for embeddingFunction
. Perhaps I can add a configuration option to solve this problem.
Perhaps my guess was incorrect. After carefully examining all the implementations, the current code is actually consistent with langchain python
. I don't need to add an embedding function
additionally. In chromadb
, it only takes effect when using queryTexts
. However, I found an issue with your code.
const vectorStore = new Chroma(new OpenAIEmbeddings(), {
url,
collectionName: 'pyansys'
});
Maybe you should specify the collection
you are querying. If you don't specify it, langchain
will automatically generate a brand new collection
and perform the query.
Your implementation in python
is correct:
llm = ChatOpenAI(model='gpt-4-turbo', temperature=0, verbose=True)
embeddings = OpenAIEmbeddings()
query = "What library about?"
client = chromadb.HttpClient(host="localhost", port=8010, settings=Settings(allow_reset=True))
persistent_client = chromadb.PersistentClient()
db = Chroma(client=persistent_client, collection_name="pyansys", embedding_function=embeddings)
Yes, I also noticed that bug in my code as well. After fixing that, I created a simplified repo to try and more clearly identify the issue.
https://github.com/justinlevi/ts-langchain-chromadb
After setting this up, I noticed that each time I was seeding the database, I'd see a chroma folder getting created on my host machine each time I was trying to seed the database. It seems that something I was doing was perhaps creating two client connections, one to the remote as well as a new one locally???
This version of my python code now seems to work as expected:
client = chromadb.HttpClient(host="localhost", port=8000, settings=Settings(allow_reset=True))
collection = client.get_or_create_collection("demo")
embeddings_function = OpenAIEmbeddings()
Chroma().from_documents(
docs, embeddings_function, client=client, collection_name="demo"
)
This just removes the persistent_client
.
I can now run my typescript code with the latest versions of langchain and get documents returned from the remote chromadb connection.
What is very strange though is that now my chroma db client fails with the following error:
🚀 ~ results { error: 'InvalidDimension' }
I was not getting this error with "chromadb": "1.7.3-beta5",
?? My goal was to get this to work with langchain js though so I guess I'm just going to let this go as a mystery for now. This remaining issue also seems more related to the chromadb nodejs client library at this point so no need to dive in further in this issue.
@jeasonnow appreciate you diving into this with me. Hopefully that github repo will help someone else in the future
Hey apologies for losing track of this!
There were a few issues with the Chroma JS library in the past, and as a result we set the peer dependency to *
until we could add a separately installable integration package to avoid breaking the broader @langchain/community
package. This is likely to cause issues as Chroma makes breaking changes.
Will prioritize that split package and update this issue when that's done and ready!
Checked other resources
Example Code
Here is my code:
results in the following output:
Note that using the ChromaClient directly results in documents being returned whereas the Langchain Chroma class returns zero results.
The following python code works as expected as well:
Error Message and Stack Trace (if applicable)
None
Description
I'm just trying to get a basic RAG example working with the chroma db vectorstore
System Info
"yarn info langchain" "dependencies": { "@langchain/community": "^0.2.1", "@langchain/openai": "^0.0.33", "@types/node": "^20.12.12", "chromadb": "1.7.3-beta5", "chromadb-default-embed": "^2.13.2", "langchain": "^0.2.0", "typescript": "^5.4.5" },
langchain-ts-project 🚀 node --version
v21.7.1
Mac