Closed Jtewen closed 3 weeks ago
Hey there @Jtewen! Great to see you diving into another interesting challenge with LangChain. Let's get to the bottom of this one too!
Thank you for bringing this to our attention. You're correct that the ScoreThresholdRetriever
in LangChain.js currently assumes that higher scores indicate higher similarity, which is incompatible with the HNSWLib's scoring method where 0 indicates the most similar and 1 the most dissimilar.
A similar issue was previously encountered with the Prisma VectorStore, as documented in this issue. The proposed solution was to introduce a new configuration parameter that specifies the type of scoring system used. This would allow the ScoreThresholdRetriever
to handle both cases, using the scoreType
parameter to determine how to interpret the scores returned by the vector store.
Here's a rough idea of how the changes might look:
export const cacheRetriever = async () => ScoreThresholdRetriever
.fromVectorStore(await cacheStore(), {
minSimilarityScore: 0.9,
kIncrement: 2,
maxK: 100,
scoreType: 'distance' // 'distance' or 'similarity'
});
In the getRelevantDocuments
method, the interpretation of the scores would then depend on the scoreType
:
if (this.scoreType === 'distance') {
// For distance metrics, lower scores are better
return scores.filter(score => score <= this.minSimilarityScore);
} else {
// For similarity scores, higher scores are better
return scores.filter(score => score >= this.minSimilarityScore);
}
Please note that these changes would need to be implemented in the LangChain.js codebase. As a workaround until this issue is addressed, you could manually invert the scores returned by HNSWLib before passing them to the ScoreThresholdRetriever
. This would align the scores with the ScoreThresholdRetriever
's expectation that higher scores indicate higher similarity.
I hope this helps! If you have any further questions or need additional clarification, please don't hesitate to ask.
I've encountered similar issue with Prisma or PGVector when retrieving relevant documents based on a query and a maximum similarity distance threshold.
The core of the problem lies in the way the original ScoreThresholdRetriever implementation compares the similarity score, which doesn't align with how databases that return vector distances operate.
Original:
async getRelevantDocuments(query) {
let currentK = 0;
let filteredResults = [];
do {
currentK += this.kIncrement;
const results = await this.vectorStore.similaritySearchWithScore(query, currentK, this.filter);
// COMPARISON: score >= this.minSimilarityScore
filteredResults = results.filter(([, score]) => score >= this.minSimilarityScore);
} while (filteredResults.length >= currentK && currentK < this.maxK);
return filteredResults.map((documents) => documents[0]).slice(0, this.maxK);
}
Patched version:
async getRelevantDocuments(query: string): Promise<Document[]> {
let currentK = 0;
let filteredResults: [Document, number][] = [];
do {
currentK += this.kIncrement;
const results = await this.vectorStore.similaritySearchWithScore(query, currentK, this.filter);
filteredResults = results.filter(([, distance]) => {
// COMPARISON: distance <= this.maxSimilarityDistance
return distance <= this.maxSimilarityDistance;
});
} while (filteredResults.length >= currentK && currentK < this.maxK);
return filteredResults.map((documents) => documents[0]).slice(0, this.maxK);
}
In the original implementation, a higher score indicates more relevance, whereas, in a database that returns vector distances (see Prisma implementation), a lower distance indicates more relevance. This inversion means that the original implementation is unsuitable for such use cases.
To address this, I've implemented a custom function to handle the filtering process appropriately. This custom function ensures that documents with a similarity distance below the threshold are filtered out correctly, considering the nature of vector distances.
I wonder if I am missing something that required this patch. If anyone else is experiencing the same problem or has any insights, please let us know.
Checked other resources
Example Code
const cacheStore = await HNSWLib.load(path-to-vectorstore, some-embedder);
export const cacheRetriever = async () => ScoreThresholdRetriever .fromVectorStore(await cacheStore(), { minSimilarityScore: 0.9, kIncrement: 2, maxK: 100 });
console.log((await cacheRetriever).getRelevantDocuments("hello");
Error Message and Stack Trace (if applicable)
No response
Description
HNSWLib similarity scores are inverted, where 0 is most similar and 1 is most dissimilar. This doesn't work with the ScoreThresholdRetriever as it assumes that higher scores are more similar.
System Info
MacOS Node v21.1.0 Langchain 0.1.24