langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.4k stars 2.1k forks source link

The SelfQueryRetriever for the SupabaseVectorStore does not seem to find anything in my pgvector table #5876

Closed HenkBourgonje closed 3 days ago

HenkBourgonje commented 3 months ago

Checked other resources

Example Code

        const embeddings = new OpenAIEmbeddings({
            openAIApiKey: CHATGPT_API_KEY,
            maxConcurrency: 5
        })

        // Defining openai instance
        const llm = new OpenAI({
            modelName: "gpt-4o",
            temperature: 0.7,
            openAIApiKey: CHATGPT_API_KEY
        });

        // defining my vector store
        const vectorStore = await SupabaseVectorStore.fromExistingIndex(embeddings, {
            client: adminSupabase,
            tableName: 'documents', // this is correct
            queryName: "match_documents", // this exists and works, because its also used successfully elsewhere in the app
        })

        // Defining my attribute info
        export const storeAttributeInfo: AttributeInfo[] = [
            {
                name: "product_code",
                description: "The unique code of the product.",
                type: "string",
            },
            {
                name: "name",
                description: "The name of the product.",
                type: "string",
            },
            {
                name: "price",
                description: "The price of the product.",
                type: "string",
            },
            {
                name: "stock",
                description: "The stock level of the product.",
                type: "number",
            },
        ];

        // The retriever itself, including a filter by chatbot_id because I this is inside a platform containing multiple chatbots.
        const selfQueryRetriever = SelfQueryRetriever.fromLLM({
            llm,
            vectorStore,
            documentContents: "Brief description of the product.",
            attributeInfo: storeAttributeInfo,
            structuredQueryTranslator: new SupabaseTranslator(),
            verbose: true,
            searchParams: {
                k: 8,
                filter: (rpc: SupabaseFilter) => rpc.filter("metadata->>chatbot_id", "eq", chatbotId),
                mergeFiltersOperator: "and",
            }
        });

        // This is always []
        const documents = await selfQueryRetriever.invoke(searchQuery)

Error Message and Stack Trace (if applicable)

No response

Description

I have a RAG chatbot that I want to answer questions about the inventory of the webshop of my client. It seemed that Self Query would be a good way of Retrieval for this use case because the questions about these products will contain meta data. I am trying to implement it using the SupabaseVectorStore but it does not seem to work as expected. For example:

This is what the metadata column looks like of a product:

{
  "loc": {
    "lines": {
      "to": 1,
      "from": 1
    }
  },
  "name": "Bolt 16x8mm",
  "price": "0.80 EUR",
  "stock": 1017,
  "chatbot_id": 134,
  "product_code": "123456"
}

As you can see, these values are defined in the attributeInfo array of my Self Query implementation and I expect to be able to ask my chatbot questions about it. When I ask the following question for example: "Of what product do you have more than 1000 in stock?", it correctly creates the search filter stating that stock should be greater than 1000 (as seen in the logging since verbose is set to true).

How could it be, that even though the filters are created correctly and the documents exist in the db, the Retriever has never returned any documents?

System Info

linux langchain@0.2.6 @langchain/community@0.2.13 @langchain/core@0.2.9 @langchain/openai@0.2.0 node@v18.16.0

dosubot[bot] commented 3 months ago

The issue with the SelfQueryRetriever for the SupabaseVectorStore not returning any documents, despite the filters being created correctly and the documents existing in the database, might be due to the mergeFilters function in the SupabaseTranslator class. Specifically, if both the defaultFilter and generatedFilter are empty, the function returns undefined, which means no filter is applied, and thus no documents are retrieved.

Here is the relevant part of the code:

mergeFilters(
  defaultFilter: SupabaseFilterRPCCall | SupabaseMetadata | undefined,
  generatedFilter: SupabaseFilterRPCCall | undefined,
  mergeType = "and"
): SupabaseFilterRPCCall | SupabaseMetadata | undefined {
  if (isFilterEmpty(defaultFilter) && isFilterEmpty(generatedFilter)) {
    return undefined;
  }
  if (isFilterEmpty(defaultFilter) || mergeType === "replace") {
    if (isFilterEmpty(generatedFilter)) {
      return undefined;
    }
    return generatedFilter;
  }
  if (isFilterEmpty(generatedFilter)) {
    if (mergeType === "and") {
      return undefined;
    }
    return defaultFilter;
  }

  let myDefaultFilter = defaultFilter;
  if (isObject(defaultFilter)) {
    const { filter } = this.visitStructuredQuery(
      convertObjectFilterToStructuredQuery(defaultFilter)
    );
    // just in case the built filter is empty somehow
    if (isFilterEmpty(filter)) {
      if (isFilterEmpty(generatedFilter)) {
        return undefined;
      }
      return generatedFilter;
    }
    myDefaultFilter = filter;
  }
  // After this point, myDefaultFilter will always be SupabaseFilterRPCCall
  if (mergeType === "or") {
    return (rpc) => {
      const defaultFlattenedParams = ProxyParamsDuplicator.getFlattenedParams(
        rpc,
        myDefaultFilter as SupabaseFilterRPCCall
      );
      const generatedFlattenedParams =
        ProxyParamsDuplicator.getFlattenedParams(rpc, generatedFilter);
      return rpc.or(`${defaultFlattenedParams},${generatedFlattenedParams}`);
    };
  } else if (mergeType === "and") {
    return (rpc) =>
      generatedFilter((myDefaultFilter as SupabaseFilterRPCCall)(rpc));
  } else {
    throw new Error("Unknown merge type");
  }
}

Additionally, ensure that your environment variables for Supabase are correctly set, as missing these can also cause issues. Here is a test case that demonstrates the correct setup and usage:

test("Supabase Store Self Query Retriever Test", async () => {
  const docs = [
    new Document({
      pageContent:
        "A bunch of scientists bring back dinosaurs and mayhem breaks loose",
      metadata: { year: 1993, rating: 7.7, genre: "science fiction" },
    }),
    new Document({
      pageContent:
        "Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
      metadata: { year: 2010, director: "Christopher Nolan", rating: 8.2 },
    }),
    new Document({
      pageContent:
        "A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
      metadata: { year: 2006, director: "Satoshi Kon", rating: 8.6 },
    }),
    new Document({
      pageContent:
        "A bunch of normal-sized women are supremely wholesome and some men pine after them",
      metadata: {
        year: 2019,
        director: "Greta Gerwig",
        rating: 8.3,
        genre: "drama",
      },
    }),
    new Document({
      pageContent: "Toys come alive and have a blast doing so",
      metadata: { year: 1995, genre: "animated" },
    }),
    new Document({
      pageContent:
        "Three men walk into the Zone, three men walk out of the Zone",
      metadata: {
        year: 1979,
        director: "Andrei Tarkovsky",
        genre: "science fiction",
        rating: 9.9,
      },
    }),
    new Document({
      pageContent: "10x the previous gecs",
      metadata: {
        year: 2023,
        title: "10000 gecs",
        artist: "100 gecs",
        rating: 9.9,
        type: "album",
      },
    }),
  ];

  const attributeInfo: AttributeInfo[] = [
    {
      name: "genre",
      description: "The genre of the movie",
      type: "string or array of strings",
    },
    {
      name: "year",
      description: "The year the movie was released",
      type: "number",
    },
    {
      name: "director",
      description: "The director of the movie",
      type: "string",
    },
    {
      name: "rating",
      description: "The rating of the movie (1-10)",
      type: "number",
    },
    {
      name: "length",
      description: "The length of the movie in minutes",
      type: "number",
    },
  ];

  if (
    !process.env.SUPABASE_VECTOR_STORE_URL ||
    !process.env.SUPABASE_VECTOR_STORE_PRIVATE_KEY
  ) {
    throw new Error(
      "Supabase URL or private key not set. Please set it in the .env file"
    );
  }

  const embeddings = new OpenAIEmbeddings();
  const llm = new OpenAI();
  const documentContents = "Brief summary of a movie";
  const client = createClient(
    process.env.SUPABASE_VECTOR_STORE_URL,
    process.env.SUPABASE_VECTOR_STORE_PRIVATE_KEY
  );
  const vectorStore = new SupabaseVectorStore(embeddings, { client });
  // idempotency
  const opts = { ids: docs.map((_, idx) => idx) };
  await vectorStore.addDocuments(docs, opts);
  const selfQueryRetriever = SelfQueryRetriever.fromLLM({
    llm,
    vectorStore,
    documentContents,
    attributeInfo,
    structuredQueryTranslator: new SupabaseTranslator(),
    searchParams: {
      filter: (rpc: SupabaseFilter) => rpc.filter("metadata->>type", "eq", "movie"),
      mergeFiltersOperator: "and",
    }
  });

  const query1 = await selfQueryRetriever.getRelevantDocuments(
    "Which movies are less than 90 minutes?"
  );
  console.log(query1);
  expect(query1.length).toEqual(0);
  const query2 = await selfQueryRetriever.getRelevantDocuments(
    "Which movies are rated higher than 8.5?"
  );
  console.log(query2);
  expect(query2.length).toEqual(3);
  const query3 = await selfQueryRetriever.getRelevantDocuments(
    "Which movies are directed by Greta Gerwig?"
  );
  console.log(query3);
  expect(query3.length).toEqual(1);
});

Ensure that the filters are not empty and are correctly merged to retrieve the documents. Additionally, verify that your environment variables are correctly set and that the documents are properly added to the vector store. You can also initialize the retriever with default search parameters that apply in addition to the generated query:

const selfQueryRetriever = SelfQueryRetriever.fromLLM({
  llm,
  vectorStore,
  documentContents,
  attributeInfo,
  structuredQueryTranslator: new SupabaseTranslator(),
  searchParams: {
    filter: (rpc: SupabaseFilter) => rpc.filter("metadata->>type", "eq", "movie"),
    mergeFiltersOperator: "and",
  }
});

See the official docs for more on how to construct metadata filters [1][2][3].

To continue talking to Dosu, mention @dosu.