SelfQueryRetriever, ValueError: Expected where operand value to be a str, int, float, or list of those type

langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications

https://python.langchain.com

MIT License

93.66k stars 15.1k forks source link

SelfQueryRetriever, ValueError: Expected where operand value to be a str, int, float, or list of those type #15696

Closed XariZaru closed 6 months ago

XariZaru commented 9 months ago

System Info

Chroma 0.4.22 Langchain 0.0.354

Who can help?

@agola11

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Create a SelfQueryRetriever
Create AttributeInfo metadata list in preparation for filtering based off metadata.

self_query_retriever = SelfQueryRetriever.from_llm(
        llm,
        vectorstore,
        "Information about when document was published and where it originated from",
        metadata_field_info
    )

    # retriever = MergerRetriever(retrievers=[parent_retriever, self_query_retriever])

    retriever = self_query_retriever

    template = """
    ### Instruction: You're an assistant who knows the following information:
    ### {context}

    If you don't know the answer, then say you don't know and refer the user to the respective department for extra information.        
    Absolutely do not mention you are an AI language model. Use only the chat history and the following information. 

    ### {chat_history}

    ### Input: {question}
    ### Response:
    """.strip()

    prompt = PromptTemplate(input_variables=["context", "chat_history", "question"], template=template)

    chain = ConversationalRetrievalChain.from_llm(
        llm,
        chain_type="stuff",
        retriever=retriever,
        combine_docs_chain_kwargs={"prompt": prompt},#, "metadata_weights": metadata_weights},
        return_source_documents=True,
        verbose=False,
        rephrase_question=True,
        max_tokens_limit=16000,
        response_if_no_docs_found="""I'm sorry, but I was not able to find the answer to your question based on the information I know. You may have to reach out to the respective internal department for more details regarding your inquiry."""
    )

    return chain

def score_unstructured(model, data, query, **kwargs) -> str:
    """Custom model hook for making completions with our knowledge base.

    When requesting predictions from the deployment, pass a dictionary
    with the following keys:
    - 'question' the question to be passed to the retrieval chain
    - 'chat_history' (optional) a list of two-element lists corresponding to
      preceding dialogue between the Human and AI, respectively

    datarobot-user-models (DRUM) handles loading the model and calling
    this function with the appropriate parameters.

    Returns:
    --------
    rv : str
        Json dictionary with keys:
            - 'question' user's original question
            - 'chat_history' chat history that was provided with the original question
            - 'answer' the generated answer to the question
            - 'references' list of references that were used to generate the answer
            - 'error' - error message if exception in handling request
    """
    import json

    try:
        chain = model
        data_dict = json.loads(data)

        if 'chat_history' in data_dict:
            chat_history = [(human, ai,) for human, ai in data_dict['chat_history']]
        else:
            chat_history = []# model.chat_history
        rv = chain(
                inputs={
                    'question': data_dict['question'],
                    'chat_history': chat_history,
                },
             )

        source_docs = rv.pop('source_documents')
        rv['references'] = [doc.metadata['source'] for doc in source_docs]

        if len(source_docs) > 0:
            rv["top_reference_text"] = [doc.page_content for doc in source_docs]
        else:
            rv["top_reference_text"] = ""
    except Exception as e:
        rv = {'error': f"{e.__class__.__name__}: {str(e)}"}
    return json.dumps(rv)

model = load_model(".")

I asked the following question:

questions = ["What is the minimum opening deposit for each account as of January 2023?"]

os.environ["TOKENIZERS_PARALLELISM"] = "false"

for question in questions:
    rv = score_unstructured(model, json.dumps(
            {
                "question": question
                # "chat_history": []
            }
        ), 
        None)
    print(rv)
    print(question.upper())
    print(json.loads(rv)["answer"])
    print(json.loads(rv))
    print("------------------------------------------------")

The issue I got was ValueError: Expected where operand value to be a str, int, float, or list of those type, got {'date': '2023-01-01', 'type': 'date'}

It looks like the SelfQueryRetriever converted my question that had January 2023 to a date object. This date object throws an error. I'm not sure how to resolve this issue on my end.

Expected behavior

Query with a date and receive an answer from the SelfQueryRetriever.

dosubot[bot] commented 9 months ago

🤖

Thank you for providing a detailed description of the issue you're encountering. It seems like the SelfQueryRetriever is indeed converting your date string into a date object, which is causing the ValueError.

To better assist you, could you please provide the following additional information:

Could you share the structure of your metadata_field_info? This would help in understanding how the metadata is being processed.
Could you provide a sample of the data you're working with, especially the part that includes the date? This would help in reproducing the issue.
Are there any specific settings or configurations you have applied to the SelfQueryRetriever or the ConversationalRetrievalChain?

This additional information will help me understand the context better and provide a more accurate solution.

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

XariZaru commented 9 months ago

Here is the following metadata_field_info.

    metadata_field_info = [
        AttributeInfo(
            name="source",
            description="The document this chunk is from.",
            type="string",
        ),
        AttributeInfo(
            name="origin",
            description="The origin the document came from. Comes from either scraped websites like TheKinection.org, Kinecta.org or database files like Bancworks. Bancworks is the higher priority.",
            type="string",
        ),
        AttributeInfo(
            name="date_day",
            description="The day the document was uploaded.",
            type="integer",
        )
    ]

I am not sure which sample of data to provide. It looks like the SelfQueryRetriever is splitting up my question which includes a date in it. I am not sure this is in my data. I could be wrong.

The only configurations I have for the SelfQueryRetriever or ConversationalRetrievalChain is what I've already posted in the original snippet.

@dosubot

XariZaru commented 9 months ago

I have an updated response error now. I included more AttributeInfo metadata fields and it looks like it can split the question up better. However, there is a parsing error now.

    metadata_field_info = [
        AttributeInfo(
            name="source",
            description="The document this chunk is from.",
            type="string",
        ),
        AttributeInfo(
            name="origin",
            description="The origin the document came from. Comes from either scraped websites like TheKinection.org, Kinecta.org or database files like Bancworks. Bancworks is the higher priority.",
            type="string",
        ),
        AttributeInfo(
            name="date_day",
            description="The day the document was uploaded.",
            type="int",
        ),
        AttributeInfo(
            name="date_uploaded",
            description="The month year the document is current to.",
            type="str",
        ),
        AttributeInfo(
            name="date_month",
            description="The month the document was uploaded.",
            type="int",
        ),
        AttributeInfo(
            name="date_month_name",
            description="The month name the document was uploaded.",
            type="string",
        ),
        AttributeInfo(
            name="date_year_long",
            description="The full year the document was uploaded.",
            type="int",
        ),
        AttributeInfo(
            name="date_year_short",
            description="The short year the document was uploaded.",
            type="int",
        ),
    ]

{"error": "OutputParserException: Parsing text\n```json\n{\n    \"query\": \"minimum opening deposit\",\n    \"filter\": \"eq(\\\"date_month_name\\\", \\\"January\\\") and eq(\\\"date_year_long\\\", 2023)\"\n}\n```\n raised following error:\nUnexpected token Token('CNAME', 'and') at line 1, column 34.\nExpected one of: \n\t* $END\nPrevious tokens: [Token('RPAR', ')')]\n"}

@dosubot

dosubot[bot] commented 9 months ago

🤖

Based on the error message you provided, it seems like the issue is with the filter query you're using. The error message indicates that the parser is encountering an unexpected token 'and' at line 1, column 34. This suggests that the parser is not able to understand the 'and' operator in your filter query.

The filter query you're using is: "eq(\"date_month_name\", \"January\") and eq(\"date_year_long\", 2023)". This filter query is trying to filter the documents where the date_month_name is 'January' and the date_year_long is 2023.

However, based on the error message, it seems like the parser is not able to understand the 'and' operator. This could be due to the fact that the parser is not configured to understand the 'and' operator, or it could be due to a syntax error in your filter query.

One way to resolve this issue is to check the documentation of the LangChain framework to see how to use the 'and' operator in filter queries. If the 'and' operator is not supported, you might need to use a different operator or method to achieve the same result.

Another way to resolve this issue is to check the syntax of your filter query. Make sure that the filter query is correctly formatted and that all the necessary parentheses and quotation marks are in place.

If you're still encountering the issue after checking the documentation and the syntax of your filter query, it might be helpful to look at similar issues in the LangChain repository. I found a similar issue in the LangChain repository: langchain.schema.OutputParserException: Could not parse LLM output: `. This issue suggests a workaround for the OutputParserException error, which involves catching the exception and removing the prefix and suffix from the error message. This might not directly solve your issue, but it could give you some ideas on how to handle the error.

Here's the suggested solution from the issue:

@serving
def ask(input: str) -> str:
    print("-- Serving request for input: %s" % input)
    try:
         response= agent_executor.run(input)
    except Exception as e:
         response = str(e)
         if response.startswith("Could not parse LLM output: `"):
              response = response.removeprefix("Could not parse LLM output: `").removesuffix("`")
    return response

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

XariZaru commented 9 months ago

Sorry, I'm still not sure what the best approach for this is? The query seems to be quite simple enough and based off of the examples demonstrated in the SelfQueryRetriever page this query should be handled just fine. @dosu-bot

dosubot[bot] commented 9 months ago

🤖

Based on the error message you provided, it seems like the 'and' operator is not recognized by the parser. This could be due to the fact that the 'and' operator is not included in the allowed_operators for the structured_query_translator being used.

In the from_llm class method of SelfQueryRetriever, the allowed_operators for the structured_query_translator are set as follows:

if (
    "allowed_operators" not in chain_kwargs
    and structured_query_translator.allowed_operators is not None
):
    chain_kwargs[
        "allowed_operators"
    ] = structured_query_translator.allowed_operators

If the 'and' operator is not included in the allowed_operators of the structured_query_translator, it will not be allowed in the filter query.

To resolve this issue, you can include the 'and' operator in the allowed_operators of the structured_query_translator when you're creating an instance of SelfQueryRetriever. Here's an example of how you can do this:

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_contents,
    metadata_field_info,
    structured_query_translator=structured_query_translator,
    chain_kwargs={
        "allowed_operators": ["and", "or", "<", ">", "=", "<=", ">=", "!="]  # Include 'and' and any other operators you need
    },
    enable_limit=enable_limit,
    use_original_query=use_original_query,
    **kwargs,
)

Please replace llm, vectorstore, document_contents, metadata_field_info, structured_query_translator, enable_limit, use_original_query, and kwargs with your actual values.

You can find more details about this in the LangChain repository.

Regarding your question about whether the LangChain framework supports the 'and' operator in the SelfQueryRetriever filter query by default, it does. This is evident from the RedisFilterExpression class, which has a method __and__ that combines two RedisFilterExpression instances using the 'and' operator. You can find more details about this in the LangChain repository.

I hope this helps! If you have any other questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

XariZaru commented 9 months ago

Even after setting the chain_kwargs allowed operators, I am still getting the same issue. Can I get some extra help on the matter?

@dosu-bot

chain_kwargs={
        "allowed_operators": ["and", "or", "<", ">", "=", "<=", ">=", "!="]  # Include 'and' and any other operators you need
    },

dosubot[bot] commented 9 months ago

🤖

Based on the information you've provided, it seems like the issue might be related to how the 'and' operator is being handled in the OpenSearchTranslator class of the LangChain framework.

In the visit_operation method, which is responsible for handling operations (including logical operations like 'and', 'or', 'not'), the _format_func method should be called for the operator. This method is responsible for translating the operators and comparators from the internal domain-specific language to the OpenSearch query language. In the case of the 'and' operator, it should be translated to 'must'.

However, it appears that _format_func is not being called for the operator. Instead, the operator is being directly used in the OpenSearch query, which is why the 'and' operator is not being recognized.

Here is the problematic code:

def visit_operation(self, operation: Operation) -> Dict:
    args = [arg.accept(self) for arg in operation.arguments]

    return {"bool": {self._format_func(operation.operator): args}}

To fix this issue, you should ensure that _format_func is called for the operator in visit_operation. The corrected code should look like this:

def visit_operation(self, operation: Operation) -> Dict:
    args = [arg.accept(self) for arg in operation.arguments]
    operator = self._format_func(operation.operator)

    return {"bool": {operator: args}}

You can find this code in the libs/langchain/langchain/retrievers/self_query/opensearch.py file in the LangChain repository.

As for the allowed_operators list, it is used in the LangChain framework to specify the operators that are permitted when constructing a query. This is particularly relevant in the context of the SelfQueryRetriever and structured_query_translator. If the allowed_operators is not provided in the chain_kwargs argument, it checks if the allowed_operators attribute is available in the structured_query_translator. If it is, it sets the allowed_operators in chain_kwargs to the allowed_operators from the structured_query_translator.

This allowed_operators list is then passed to the load_query_constructor_runnable function, which presumably uses it to construct a query. You can find this code in the libs/langchain/langchain/retrievers/self_query/base.py file in the LangChain repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

XariZaru commented 9 months ago

@dosu-bot Can I get additional help on this from someone else? I don't want to go down a rabbit hole just in case.

dosubot[bot] commented 9 months ago

It seems like this issue requires a deeper look, @baskaryan could you please lend your expertise?

HuitengXiao commented 4 months ago

How to solve the problems, I have the same situation......