langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.29k stars 14.74k forks source link

SelfQueryRetriever returns empty result for composite filter with query #21984

Open araravind opened 3 months ago

araravind commented 3 months ago

Checked other resources

Example Code


from langchain_community.vectorstores import PGVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI
import os

collection = "example_collection"
embeddings = OpenAIEmbeddings()

def load_example_docs(search_text):

    docs = [
        Document(
            page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
            metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
        ),
        Document(
            page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
            metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
        ),
        Document(
            page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
            metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
        ),
        Document(
            page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
            metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
        ),
        Document(
            page_content="Toys come alive and have a blast doing so",
            metadata={"year": 1995, "genre": "animated", "director": "Andrei Tarkovsky"},
        ),
        Document(
            page_content="Three men walk into the Zone, three men walk out of the Zone",
            metadata={
                "year": 1979,
                "director": "Andrei Tarkovsky",
                "genre": "science fiction",
                "rating": 9.9,
            },
        ),
    ]
    vectorstore = PGVector.from_documents(
        docs,
        embeddings,
        collection_name=collection
    )

    metadata_field_info = [
        AttributeInfo(
            name="genre",
            description="The genre of the movie",
            type="string or list[string]",
        ),
        AttributeInfo(
            name="year",
            description="The year the movie was released",
            type="integer",
        ),
        AttributeInfo(
            name="director",
            description="The name of the movie director",
            type="string",
        ),
        AttributeInfo(
            name="rating", description="A 1-10 rating for the movie", type="float"
        ),
    ]
    document_content_description = "Brief summary of a movie"
    llm = OpenAI(temperature=0)
    retriever = SelfQueryRetriever.from_llm(
        llm, vectorstore, document_content_description, metadata_field_info, verbose=True

    )
    invoke = retriever.invoke(search_text)
    print(invoke)

#example 1
load_example_docs("What's a movie that's all about toys released in 1995 of genre animated and directed by Andrei Tarkovsky")
#example 2
load_example_docs("Has Greta Gerwig directed any movies about women")
#example 3
load_example_docs("I want to watch a movie rated higher than 8.5")
#example 4
load_example_docs("What's a highly rated (above 8.5) science fiction film?")
#example 5
load_example_docs("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated")

Error Message and Stack Trace (if applicable)

No response

Description

SelfQueryRetriever returns empty result for composite filter with query. In the above code, for example 1 - the llm returns the filter and arguments correctly. Here is the output from the llm

{
  "output": {
    "query": "toys",
    "filter": {
      "operator": "and",
      "arguments": [
        {
          "comparator": "eq",
          "attribute": "year",
          "value": 1995
        },
        {
          "comparator": "eq",
          "attribute": "genre",
          "value": "animated"
        },
        {
          "comparator": "eq",
          "attribute": "director",
          "value": "Andrei Tarkovsky"
        }
      ]
    }
  }
}

But the SelfQueryRetriever returns empty result even though the Document 5 exactly matches the filter and query. The example - 5 also is not returning the correct document. The code added here is from the langchain documentation https://python.langchain.com/v0.1/docs/integrations/retrievers/self_query/pgvector_self_query/. The only change that is made here is I have added "director": "Andrei Tarkovsky" as metadata to Document 5.

System Info

langchain==0.1.20 langchain-community==0.0.38 langchain-core==0.1.52 langchain-openai==0.1.6

Platform - ubuntu

xanjay commented 3 weeks ago

@araravind It looks like some functionality is deprecated in v0.1. Did you try the retriever in v0.2? https://python.langchain.com/v0.2/docs/integrations/vectorstores/pgvector/#query-by-turning-into-retriever