Issues with SelfQueryRetriever and the "AND" operator failing in queries that search for multiple metadata flags

langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications

https://python.langchain.com

MIT License

95.54k stars 15.51k forks source link

Issues with SelfQueryRetriever and the "AND" operator failing in queries that search for multiple metadata flags #15919

Closed XariZaru closed 5 months ago

XariZaru commented 10 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.

Example Code

metadata_field_info = [
        AttributeInfo(name="source",description="The document this chunk is from.",type="string"),
        AttributeInfo(name="origin",description="The origin the document came from. Comes from either scraped websites like TheKinection.org, Kinecta.org or database files like Bancworks. Bancworks is the higher priority.",type="string"),
        AttributeInfo(name="date_day",description="The day the document was uploaded.",type="string"),
        AttributeInfo(name="date_uploaded",description="The month year the document is current to.",type="string"),
        AttributeInfo(name="date_month",description="The month the document was uploaded.",type="string"),
        AttributeInfo(name="date_month_name",description="The month name the document was uploaded.",type="string"),
        AttributeInfo(name="date_year_long",description="The full year the document was uploaded.",type="string"),
        AttributeInfo(name="date_year_short",description="The short year the document was uploaded.",type="string"),
    ]

llm = ChatOpenAI(temperature=0)
vectorstore = Pinecone.from_existing_index(index_name="test", embedding=get_embedding_function())
# print("Load existing vector store")\

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    "Information about when the document was created and where it was grabbed from.",
    metadata_field_info,
)

question = "Give the minimum opening deposits for each accounts for the rate sheets in January"
retriever.get_relevant_documents(question)

Description

When I ask to fetch relevant documents with the following query:

"Give the minimum opening deposits for each accounts for the rate sheets in January"

There is no problem. However, if I make this query a little more robust...

"Give the minimum opening deposits for each accounts for the rate sheets in January 2023"

I get a CNAME "and" error. This happens in both Pinecone and ChromaDB. Something is wrong with how the query translator is operating or I am missing some crucial step. We should be able to use multiple metadata flags at once.

System Info

Python 3.11 Langchain 0.1.0 Chroma 0.4.22 Pinecone 2.2.4

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

dosubot[bot] commented 10 months ago

🤖

Hello @XariZaru! Good to see you back in the LangChain repository. Hope your day is going well!

Based on the information you've provided, it seems like the issue might be related to how the SelfQueryRetriever is handling queries with multiple metadata flags. The "AND" operator in the SelfQueryRetriever class is used to combine multiple conditions in a query. When used, all conditions must be met for a record to be returned.

In the context of the ElasticsearchTranslator, the "AND" operator is translated into a "must" clause within a "bool" query in Elasticsearch. This means that all conditions specified in the "must" clause must be satisfied for a document to match the query.

Here is a relevant code snippet from the LangChain repository:

def test_visit_operation() -> None:
    op = Operation(
        operator=Operator.AND,
        arguments=[
            Comparison(comparator=Comparator.EQ, attribute="foo", value=2),
            Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
        ],
    )
    expected = {
        "bool": {
            "must": [
                {"term": {"metadata.foo": 2}},
                {"term": {"metadata.bar.keyword": "baz"}},
            ]
        }
    }
    actual = DEFAULT_TRANSLATOR.visit_operation(op)
    assert expected == actual

In this test case, an Operation object is created with the "AND" operator and two Comparison objects as arguments. The expected output is a dictionary representing an Elasticsearch query where both conditions (metadata.foo equals 2 and metadata.bar.keyword equals "baz") must be met. The actual output from the visit_operation method of the ElasticsearchTranslator is then compared to this expected output. You can find this in the libs/langchain/tests/unit_tests/retrievers/self_query/test_elasticsearch.py file.

If you're still encountering the issue, could you please provide the exact error message you're receiving? This will help in diagnosing the problem more accurately.

Sources

libs/langchain/tests/unit_tests/retrievers/self_query/test_elasticsearch.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

sweetkane commented 10 months ago

When the llm is generating the query, it seems to misunderstand the usage of the and operator. The prompt for query generation states the following:

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

but the llm returns the following query:

{ 
    "query": "minimum opening deposits",
    "filter": "eq(\"date_month_name\", \"January\") and eq(\"date_year_long\", \"2023\")"
}

I think the correct query would be:

{ 
    "query": "minimum opening deposits",
    "filter": "and(eq(\"date_month_name\", \"January\"), eq(\"date_year_long\", \"2023\"))"
}

I think this boils down to the LLM getting confused. The solution might be to update the prompt with more examples.

XariZaru commented 10 months ago

Thank you for taking time out to elaborate on the issue. I'm still acclimating to Langchain and LLMs in particular. I do have a prompt template that I am using right now. Would your suggestion change if I aim to use a ConversationalRetrievalChain?

"""
    ### Instruction: You're an assistant at KFCU who knows the following information:
    ### {context}

    If you don't know the answer, then say you don't know and refer the user to the respective department for extra information.        
    Absolutely do not mention you are an AI language model. Use only the chat history and the following information. 

    ### {chat_history}

    ### Input: {question}
    ### Response:
    """

I looked through the documentation and saw some examples as such:

example_template = """Here's an example of an interaction:

Q: {example_q}
A: {example_a}"""
example_prompt = PromptTemplate.from_template(example_template)

If I have many styles of prompts how would that work given my current prompt?

sweetkane commented 10 months ago

Sorry for being unclear. During the get_relevant_documents call, the SelfQueryRetriever prompts the llm to create a structured query to get documents using the metadata field info. The llm returns a malformed query. That's what I'm referring to in the previous comment.

XariZaru commented 10 months ago

@sweetkane Oh, I see. Are you talking about the LCEL section of the following page? Self Query

Edit: I found it! Also, how would you recommend making an example where I ask for the SelfQuery to fetch the latest version of a document uploaded (if I have the metadata upload date).

I have done the following:

from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)

from langchain.chains.query_constructor.base import load_query_constructor_runnable

examples = [
    (
        "What are the minimum opening deposits for a rate sheet current for January 2023?",
        {
            "query": "rate sheet, minimum opening deposit",
            "filter": 'and(eq("date_month_name", "January"), eq("date_year_long", 2023))',
        },
    ),
    (
        "What is the minimum opening deposit for an IRA Money Market account for March 2021?",
        {
            "query": "minimum opening deposit, IRA Money Market account",
            "filter": 'and(eq("date_month_name", "March"), eq("date_year_long", 2021))',
        },
    ),
    # (
    #     "What is today's ",
    #     {
    #         "query": "minimum opening deposit, IRA Money Market account",
    #         "filter": 'and(eq("date_month_name", "March"), eq("date_year_long", 2021))',
    #     },
    # ),
]

llm = ChatOpenAI(temperature=0)

doc_contents = "Information about various documents, the date they are up to date with and where they were sourced from."

chain = load_query_constructor_runnable(
    ChatOpenAI(model="gpt-3.5-turbo", temperature=0), doc_contents, metadata_field_info, examples=examples#, fix_invalid=True
)

print("Chain invoke")
print(chain.invoke({"query": "I want to see the rate sheet for January 2023"}))

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains import RetrievalQA

question = "Give the minimum opening deposits for a rate sheet in January 2023"

vectorstore = Pinecone.from_existing_index(index_name="test", embedding=get_embedding_function())
# print("Load existing vector store")\

retriever = SelfQueryRetriever(
    query_constructor=chain, vectorstore=vectorstore, verbose=True
)

retriever.get_relevant_documents(question)

The retriever now returns no documents. It doesn't throw an error at least. However, based on metadata, I think it should be able to grab the documents. Here is an example document.

(Document(page_content='File Name: 28363 Rate Sheet Update 040123.pdf, Document Name: 2023 Rate Sheets, Date: 03/31/23\n\n28363-03/28/2023\n\nFEDERALLY INSURED BY NCUA\n\nEffective Date\n\nApril 1, 2023\n\nConsumer Dividend Rate Sheet\n\n(Rates are subject to change without notice.)\n\nACCOUNT TYPE\n\nMINIMUM OPENING\n\nDEPOSIT\n\nMINIMUM BALANCE\n\nTO EARN STATED APY\n\nDIVIDEND\n\nRATE\n\nAPY1\n\nRegular Share/Savings Account\n\n$5\n\n$5\n\n0.05%\n\n0.05%', metadata={'date_day': '31', 'date_month': '3', 'date_month_name': 'March', 'date_uploaded': datetime.date(2023, 3, 12), 'date_year_long': '2023', 'date_year_short': '23', 'doc_id': 'fd966b38-ebee-4da0-919d-2f3fafa28910', 'origin': 'bancworks', 'source': 'storage/Bancworks/28363 Rate Sheet Update 040123.pdf.txt'}),
  0.859830618),

@dosu-bot

dosubot[bot] commented 10 months ago

🤖

Thank you for providing such a detailed description of the issue you're facing. It seems like the problem lies in the way the SelfQueryRetriever is handling complex queries with multiple metadata flags.

The SelfQueryRetriever in LangChain uses a combination of a query constructor and a structured query translator to handle complex queries. The query constructor generates a structured query from the input query string, which includes the metadata flags. This structured query is then passed to the structured query translator, which translates it into parameters that can be used to search the vector store.

The error you're encountering might be due to the way the "AND" operator is being handled in the structured query translator. If the translator is not correctly interpreting the "AND" operator, it could lead to the CNAME error you're seeing.

Unfortunately, without access to the source code of the SelfQueryRetriever and the structured query translator, it's difficult to pinpoint the exact cause of the issue. However, you might want to check the implementation of the visit_structured_query method in the structured query translator. This method is responsible for translating the structured query into a new query string and a dictionary of keyword arguments. If the "AND" operator is not being correctly translated, it could lead to the error you're seeing.

Here is the relevant code from the SelfQueryRetriever:

def _prepare_query(
        self, query: str, structured_query: StructuredQuery
    ) -> Tuple[str, Dict[str, Any]]:
        new_query, new_kwargs = self.structured_query_translator.visit_structured_query(
            structured_query
        )
        if structured_query.limit is not None:
            new_kwargs["k"] = structured_query.limit
        if self.use_original_query:
            new_query = query
        search_kwargs = {**self.search_kwargs, **new_kwargs}
        return new_query, search_kwargs

This method prepares the query for execution by calling the visit_structured_query method of the structured query translator. If the "AND" operator is not being correctly handled in this method, it could lead to the error you're seeing.

I hope this helps! If you have any more information or if there's anything else I can do to assist, please let me know.

Sources

libs/langchain/langchain/retrievers/self_query/base.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

XariZaru commented 10 months ago

I have no idea what happened. I stopped my work on Friday and ran the code today without touching anything and it returned documents. Previously on Friday, running the SelfQueryRetriever yielded no documents even though the metadata filtering returned something.

znwilkins commented 9 months ago

Further to XariZaru's post, you can pass the examples directly to the class method which helps mitigate the issue but doesn't solve it entirely:

sq_retriever = SelfQueryRetriever.from_llm(
     llm=llm,
     vectorstore=vector_store,
     document_contents=document_content_description,
     metadata_field_info=METADATA_FIELD_INFO,
     chain_kwargs={
         "examples": [
             (
                "According to ABC_123, what is foobar?",
                {
                    "query": "foobar",
                    "filter": 'eq("id", "ABC_123")',
                },
            ),
        ]
     },
)

newssnap commented 3 months ago

@XariZaru Did it get resolved?

I'm still facing issues, even when I have passed samples of and.