Filters dont work with Azure Search Vector Store retriever

levalencia commented 11 months ago

System Info

langchain 0.0.342 langchain-core 0.0.7 azure-search-documents 11.4.0b8

Python: 3.10

Who can help?

@hw

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

The following code works fine:

from langchain_core.vectorstores import VectorStore, VectorStoreRetriever

index_name: str = "langchain-vector-demo"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name="vector-1701341754619",
    embedding_function=embeddings.embed_query
)

res = vector_store.similarity_search(
    query="Can Colleagues contact their managers?", k=20, search_type="hybrid",  filters="title eq '25_1_0.pdf'")

The res object contains the chunkcs where title is 25_1_0.pdf' ONLY

However when using it with an LLM:

llm = AzureChatOpenAI(
    azure_deployment="chat",
    openai_api_version="2023-05-15",
)

retriever = vector_store.as_retriever(search_type="similarity", filters="title eq '25_1_0.pdf'", kwargs={"k": 3})
chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=retriever,
                                    return_source_documents=True)

result = chain({"query": 'Can Colleagues contact their managers??'})

for res in result['source_documents']:
    print(res.metadata['title'])

My output has chunks which dont respect the filter:

142_2_0.pdf 99_9_0.docx 99_9_0.docx 142_2_0.pdf

Expected behavior

The answer generated with source_documents, should contain chunks which respects the given filters.

dosubot[bot] commented 11 months ago

🤖

Hello @levalencia! It's great to see you again. I hope you've been well. Let's dive into this issue you're experiencing with the LangChain framework.

Based on the information you've provided, it seems like the filters parameter is not being applied when using the AzureChatOpenAI with the RetrievalQA chain. This could be due to a number of reasons, and I'll try to provide some potential solutions based on similar issues that have been solved in the past.

One possible solution is to extend the VectorStoreRetriever class and override the get_relevant_documents method to apply the filter manually. Here is an example of how you can do this:

from langchain_core.vectorstores import VectorStoreRetriever

class FilteredRetriever(VectorStoreRetriever):
    def get_relevant_documents(self, query: str) -> List[Document]:
        results = super().get_relevant_documents(query=query)
        return [doc for doc in results if doc.metadata['title'] == '25_1_0.pdf']

retriever = FilteredRetriever(vectorstore=vector_store)

Then, you can use this FilteredRetriever in your RetrievalQA chain:

chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=retriever,
                                    return_source_documents=True)

This solution was suggested in a similar issue: Filtering retrieval with ConversationalRetrievalChain.

Another possible solution is to ensure that the filters parameter is being passed correctly to the similarity_search method. In the AzureSearch class, the filters parameter is passed to the self.client.search method of the Azure Search client. This allows the search to be filtered according to the conditions specified in the filters parameter. You can check the source code for this in the libs/langchain/langchain/vectorstores/azuresearch.py file.

If these solutions don't work, it would be helpful to know more about the error or unexpected behavior you're experiencing. For example, are you receiving any error messages? Or are the search results simply not what you're expecting? Also, it would be helpful to know the exact version of LangChain you're using, as well as the version of the Azure Search SDK. This information could help identify if there are any known issues or changes in the recent versions that might be causing this behavior.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Gauravmahapatrocdm commented 6 months ago

Hey, I am also looking for the similar solution, Did you find any solutions around this?

lkartawidjaja commented 6 months ago

@Gauravmahapatrocdm, I was able to have the filter working by passing it as a search keyword arguments.

In the above example, I got it working by using the following: retriever = vector_store.as_retriever(search_kwargs={"k":3, filters="title eq '25_1_0.pdf'"})

HeroAntagonist commented 6 months ago

retriever = vector_store.as_retriever(search_kwargs={"k":3, filters="title eq '25_1_0.pdf'"}) @lkartawidjaja How did this work if the kwargs only accepts dict format and filters=... is not in dict format?

lkartawidjaja commented 6 months ago

@HeroAntagonist. My bad it should be: retriever = vector_store.as_retriever(search_kwargs={"k":3, “filters”: “title eq '25_1_0.pdf'"})

langchain-ai / langchain