Closed XariZaru closed 5 months ago
🤖
Hello @XariZaru! Good to see you back in the LangChain repository. Hope your day is going well!
Based on the information you've provided, it seems like the issue might be related to how the SelfQueryRetriever
is handling queries with multiple metadata flags. The "AND" operator in the SelfQueryRetriever
class is used to combine multiple conditions in a query. When used, all conditions must be met for a record to be returned.
In the context of the ElasticsearchTranslator, the "AND" operator is translated into a "must" clause within a "bool" query in Elasticsearch. This means that all conditions specified in the "must" clause must be satisfied for a document to match the query.
Here is a relevant code snippet from the LangChain repository:
def test_visit_operation() -> None:
op = Operation(
operator=Operator.AND,
arguments=[
Comparison(comparator=Comparator.EQ, attribute="foo", value=2),
Comparison(comparator=Comparator.EQ, attribute="bar", value="baz"),
],
)
expected = {
"bool": {
"must": [
{"term": {"metadata.foo": 2}},
{"term": {"metadata.bar.keyword": "baz"}},
]
}
}
actual = DEFAULT_TRANSLATOR.visit_operation(op)
assert expected == actual
In this test case, an Operation object is created with the "AND" operator and two Comparison objects as arguments. The expected output is a dictionary representing an Elasticsearch query where both conditions (metadata.foo equals 2 and metadata.bar.keyword equals "baz") must be met. The actual output from the visit_operation method of the ElasticsearchTranslator is then compared to this expected output. You can find this in the libs/langchain/tests/unit_tests/retrievers/self_query/test_elasticsearch.py file.
If you're still encountering the issue, could you please provide the exact error message you're receiving? This will help in diagnosing the problem more accurately.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
When the llm is generating the query, it seems to misunderstand the usage of the and
operator. The prompt for query generation states the following:
A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
but the llm returns the following query:
{
"query": "minimum opening deposits",
"filter": "eq(\"date_month_name\", \"January\") and eq(\"date_year_long\", \"2023\")"
}
I think the correct query would be:
{
"query": "minimum opening deposits",
"filter": "and(eq(\"date_month_name\", \"January\"), eq(\"date_year_long\", \"2023\"))"
}
I think this boils down to the LLM getting confused. The solution might be to update the prompt with more examples.
Thank you for taking time out to elaborate on the issue. I'm still acclimating to Langchain and LLMs in particular. I do have a prompt template that I am using right now. Would your suggestion change if I aim to use a ConversationalRetrievalChain?
"""
### Instruction: You're an assistant at KFCU who knows the following information:
### {context}
If you don't know the answer, then say you don't know and refer the user to the respective department for extra information.
Absolutely do not mention you are an AI language model. Use only the chat history and the following information.
### {chat_history}
### Input: {question}
### Response:
"""
I looked through the documentation and saw some examples as such:
example_template = """Here's an example of an interaction:
Q: {example_q}
A: {example_a}"""
example_prompt = PromptTemplate.from_template(example_template)
If I have many styles of prompts how would that work given my current prompt?
Sorry for being unclear. During the get_relevant_documents
call, the SelfQueryRetriever
prompts the llm to create a structured query to get documents using the metadata field info. The llm returns a malformed query. That's what I'm referring to in the previous comment.
@sweetkane Oh, I see. Are you talking about the LCEL section of the following page? Self Query
Edit: I found it! Also, how would you recommend making an example where I ask for the SelfQuery to fetch the latest version of a document uploaded (if I have the metadata upload date).
I have done the following:
from langchain.chains.query_constructor.base import (
StructuredQueryOutputParser,
get_query_constructor_prompt,
)
from langchain.chains.query_constructor.base import load_query_constructor_runnable
examples = [
(
"What are the minimum opening deposits for a rate sheet current for January 2023?",
{
"query": "rate sheet, minimum opening deposit",
"filter": 'and(eq("date_month_name", "January"), eq("date_year_long", 2023))',
},
),
(
"What is the minimum opening deposit for an IRA Money Market account for March 2021?",
{
"query": "minimum opening deposit, IRA Money Market account",
"filter": 'and(eq("date_month_name", "March"), eq("date_year_long", 2021))',
},
),
# (
# "What is today's ",
# {
# "query": "minimum opening deposit, IRA Money Market account",
# "filter": 'and(eq("date_month_name", "March"), eq("date_year_long", 2021))',
# },
# ),
]
llm = ChatOpenAI(temperature=0)
doc_contents = "Information about various documents, the date they are up to date with and where they were sourced from."
chain = load_query_constructor_runnable(
ChatOpenAI(model="gpt-3.5-turbo", temperature=0), doc_contents, metadata_field_info, examples=examples#, fix_invalid=True
)
print("Chain invoke")
print(chain.invoke({"query": "I want to see the rate sheet for January 2023"}))
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains import RetrievalQA
question = "Give the minimum opening deposits for a rate sheet in January 2023"
vectorstore = Pinecone.from_existing_index(index_name="test", embedding=get_embedding_function())
# print("Load existing vector store")\
retriever = SelfQueryRetriever(
query_constructor=chain, vectorstore=vectorstore, verbose=True
)
retriever.get_relevant_documents(question)
The retriever now returns no documents. It doesn't throw an error at least. However, based on metadata, I think it should be able to grab the documents. Here is an example document.
(Document(page_content='File Name: 28363 Rate Sheet Update 040123.pdf, Document Name: 2023 Rate Sheets, Date: 03/31/23\n\n28363-03/28/2023\n\nFEDERALLY INSURED BY NCUA\n\nEffective Date\n\nApril 1, 2023\n\nConsumer Dividend Rate Sheet\n\n(Rates are subject to change without notice.)\n\nACCOUNT TYPE\n\nMINIMUM OPENING\n\nDEPOSIT\n\nMINIMUM BALANCE\n\nTO EARN STATED APY\n\nDIVIDEND\n\nRATE\n\nAPY1\n\nRegular Share/Savings Account\n\n$5\n\n$5\n\n0.05%\n\n0.05%', metadata={'date_day': '31', 'date_month': '3', 'date_month_name': 'March', 'date_uploaded': datetime.date(2023, 3, 12), 'date_year_long': '2023', 'date_year_short': '23', 'doc_id': 'fd966b38-ebee-4da0-919d-2f3fafa28910', 'origin': 'bancworks', 'source': 'storage/Bancworks/28363 Rate Sheet Update 040123.pdf.txt'}),
0.859830618),
@dosu-bot
🤖
Thank you for providing such a detailed description of the issue you're facing. It seems like the problem lies in the way the SelfQueryRetriever
is handling complex queries with multiple metadata flags.
The SelfQueryRetriever
in LangChain uses a combination of a query constructor and a structured query translator to handle complex queries. The query constructor generates a structured query from the input query string, which includes the metadata flags. This structured query is then passed to the structured query translator, which translates it into parameters that can be used to search the vector store.
The error you're encountering might be due to the way the "AND" operator is being handled in the structured query translator. If the translator is not correctly interpreting the "AND" operator, it could lead to the CNAME error you're seeing.
Unfortunately, without access to the source code of the SelfQueryRetriever
and the structured query translator, it's difficult to pinpoint the exact cause of the issue. However, you might want to check the implementation of the visit_structured_query
method in the structured query translator. This method is responsible for translating the structured query into a new query string and a dictionary of keyword arguments. If the "AND" operator is not being correctly translated, it could lead to the error you're seeing.
Here is the relevant code from the SelfQueryRetriever
:
def _prepare_query(
self, query: str, structured_query: StructuredQuery
) -> Tuple[str, Dict[str, Any]]:
new_query, new_kwargs = self.structured_query_translator.visit_structured_query(
structured_query
)
if structured_query.limit is not None:
new_kwargs["k"] = structured_query.limit
if self.use_original_query:
new_query = query
search_kwargs = {**self.search_kwargs, **new_kwargs}
return new_query, search_kwargs
This method prepares the query for execution by calling the visit_structured_query
method of the structured query translator. If the "AND" operator is not being correctly handled in this method, it could lead to the error you're seeing.
I hope this helps! If you have any more information or if there's anything else I can do to assist, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I have no idea what happened. I stopped my work on Friday and ran the code today without touching anything and it returned documents. Previously on Friday, running the SelfQueryRetriever yielded no documents even though the metadata filtering returned something.
Further to XariZaru's post, you can pass the examples directly to the class method which helps mitigate the issue but doesn't solve it entirely:
sq_retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vector_store,
document_contents=document_content_description,
metadata_field_info=METADATA_FIELD_INFO,
chain_kwargs={
"examples": [
(
"According to ABC_123, what is foobar?",
{
"query": "foobar",
"filter": 'eq("id", "ABC_123")',
},
),
]
},
)
@XariZaru Did it get resolved?
I'm still facing issues, even when I have passed samples of and
.
Checked other resources
Example Code
Description
When I ask to fetch relevant documents with the following query:
There is no problem. However, if I make this query a little more robust...
I get a CNAME "and" error. This happens in both Pinecone and ChromaDB. Something is wrong with how the query translator is operating or I am missing some crucial step. We should be able to use multiple metadata flags at once.
System Info
Python 3.11 Langchain 0.1.0 Chroma 0.4.22 Pinecone 2.2.4
Related Components