HNSWlib similarity score calculation is incompatible with ScoreThresholdRetriever

Jtewen commented 7 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain.js documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain.js rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

const cacheStore = await HNSWLib.load(path-to-vectorstore, some-embedder);

export const cacheRetriever = async () => ScoreThresholdRetriever .fromVectorStore(await cacheStore(), { minSimilarityScore: 0.9, kIncrement: 2, maxK: 100 });

console.log((await cacheRetriever).getRelevantDocuments("hello");

Error Message and Stack Trace (if applicable)

No response

Description

HNSWLib similarity scores are inverted, where 0 is most similar and 1 is most dissimilar. This doesn't work with the ScoreThresholdRetriever as it assumes that higher scores are more similar.

System Info

MacOS Node v21.1.0 Langchain 0.1.24

dosubot[bot] commented 7 months ago

Hey there @Jtewen! Great to see you diving into another interesting challenge with LangChain. Let's get to the bottom of this one too!

Thank you for bringing this to our attention. You're correct that the ScoreThresholdRetriever in LangChain.js currently assumes that higher scores indicate higher similarity, which is incompatible with the HNSWLib's scoring method where 0 indicates the most similar and 1 the most dissimilar.

A similar issue was previously encountered with the Prisma VectorStore, as documented in this issue. The proposed solution was to introduce a new configuration parameter that specifies the type of scoring system used. This would allow the ScoreThresholdRetriever to handle both cases, using the scoreType parameter to determine how to interpret the scores returned by the vector store.

Here's a rough idea of how the changes might look:

export const cacheRetriever = async () => ScoreThresholdRetriever
    .fromVectorStore(await cacheStore(), {
        minSimilarityScore: 0.9,
        kIncrement: 2,
        maxK: 100,
        scoreType: 'distance' // 'distance' or 'similarity'
    });

In the getRelevantDocuments method, the interpretation of the scores would then depend on the scoreType:

if (this.scoreType === 'distance') {
    // For distance metrics, lower scores are better
    return scores.filter(score => score <= this.minSimilarityScore);
} else {
    // For similarity scores, higher scores are better
    return scores.filter(score => score >= this.minSimilarityScore);
}

Please note that these changes would need to be implemented in the LangChain.js codebase. As a workaround until this issue is addressed, you could manually invert the scores returned by HNSWLib before passing them to the ScoreThresholdRetriever. This would align the scores with the ScoreThresholdRetriever's expectation that higher scores indicate higher similarity.

I hope this helps! If you have any further questions or need additional clarification, please don't hesitate to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

eduardolundgren commented 4 months ago

I've encountered similar issue with Prisma or PGVector when retrieving relevant documents based on a query and a maximum similarity distance threshold.

The core of the problem lies in the way the original ScoreThresholdRetriever implementation compares the similarity score, which doesn't align with how databases that return vector distances operate.

Original:

  async getRelevantDocuments(query) {
      let currentK = 0;
      let filteredResults = [];
      do {
          currentK += this.kIncrement;
          const results = await this.vectorStore.similaritySearchWithScore(query, currentK, this.filter);
          // COMPARISON: score >= this.minSimilarityScore
          filteredResults = results.filter(([, score]) => score >= this.minSimilarityScore);
      } while (filteredResults.length >= currentK && currentK < this.maxK);
      return filteredResults.map((documents) => documents[0]).slice(0, this.maxK);
  }

Patched version:

  async getRelevantDocuments(query: string): Promise<Document[]> {
    let currentK = 0;
    let filteredResults: [Document, number][] = [];
    do {
      currentK += this.kIncrement;
      const results = await this.vectorStore.similaritySearchWithScore(query, currentK, this.filter);
      filteredResults = results.filter(([, distance]) => {
        // COMPARISON: distance <= this.maxSimilarityDistance
        return distance <= this.maxSimilarityDistance;
      });
    } while (filteredResults.length >= currentK && currentK < this.maxK);
    return filteredResults.map((documents) => documents[0]).slice(0, this.maxK);
  }

In the original implementation, a higher score indicates more relevance, whereas, in a database that returns vector distances (see Prisma implementation), a lower distance indicates more relevance. This inversion means that the original implementation is unsuitable for such use cases.

To address this, I've implemented a custom function to handle the filtering process appropriately. This custom function ensures that documents with a similarity distance below the threshold are filtered out correctly, considering the nature of vector distances.

I wonder if I am missing something that required this patch. If anyone else is experiencing the same problem or has any insights, please let us know.

langchain-ai / langchainjs