Azure AI Search, metadata field is required and hardcoded in langchain community

levalencia commented 3 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Custom Retriever Code


# Code from: https://redis.com/blog/build-ecommerce-chatbot-with-redis/
class UserRetriever(BaseRetriever):

    """
    ArgenxUserRetriever class extends BaseRetriever and is designed for retrieving relevant documents
    based on a user query using hybrid similarity search with a VectorStore.

    Attributes:
    - vectorstore (VectorStore): The VectorStore instance used for similarity search.
    - username (str): The username associated with the documents, used for personalized retrieval.

    Methods:
    - clean_metadata(self, doc): Cleans the metadata of a document, extracting relevant information for display.
    - get_relevant_documents(self, query): Retrieves relevant documents based on a user query using hybrid similarity search.

    Example:
    retriever = ArgenxRetriever(vectorstore=vector_store, username="john_doe")
    relevant_docs = retriever.get_relevant_documents("How does photosynthesis work?")
    for doc in relevant_docs:
        print(doc.metadata["Title"], doc.page_content)
    """

    vectorstore: VectorStore
    username: str

    def clean_metadata(self, doc):
        """
        Cleans the metadata of a document.

        Parameters:
            doc (object): The document object.

        Returns:
            dict: A dictionary containing the cleaned metadata.

        """
        metadata = doc.metadata

        return {
            "file_id": metadata["title"], 
            "source": metadata["title"] + "_page=" + str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "page_number": str(int(metadata["chunk_id"].split("_")[-1])+1), 
            "document_title": metadata["document_title_result"] 
        }

    def get_relevant_documents(self, query):
        """
        Retrieves relevant documents based on a given query.

        Args:
            query (str): The query to search for relevant documents.

        Returns:
            list: A list of relevant documents.

        """
        docs = []
        is_match_filter = ""
        load_dotenv()
        admins = os.getenv('ADMINS', '')
        admins_list = admins.split(',')
        is_admin = self.username.split('@')[0] in admins_list

os.environ["AZURESEARCH_FIELDS_ID"] = "chunk_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "chunk"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "vector"
#os.environ["AZURESEARCH_FIELDS_TAG"] = "metadata"

        if not is_admin:
            is_match_filter = f"search.ismatch('{self.username.split('@')[0]}', 'usernames_result')"

        for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):
            cleaned_metadata = self.clean_metadata(doc)
            docs.append(Document(
                page_content=doc.page_content,
                metadata=cleaned_metadata))

        print("\n\n----------------DOCUMENTS RETRIEVED------------------\n\n", docs)

        return docs

setup langchain chain,llm


        chat = AzureChatOpenAI(
            azure_endpoint=SHD_AZURE_OPENAI_ENDPOINT,
            openai_api_version="2023-03-15-preview",
            deployment_name=    POL_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
            openai_api_key=SHD_OPENAI_KEY ,
            openai_api_type="Azure",
            model_name=POL_OPENAI_GPT_MODEL_NAME,
            streaming=True,
            callbacks=[ChainStreamHandler(g)],  # Set ChainStreamHandler as callback
            temperature=0)

        # Define system and human message prompts
        messages = [
            SystemMessagePromptTemplate.from_template(ANSWER_PROMPT),
            HumanMessagePromptTemplate.from_template("{question} Please answer in html format"),
        ]

        # Set up embeddings, vector store, chat prompt, retriever, memory, and chain
        embeddings = setup_embeddings()
        vector_store = setup_vector_store(embeddings)
        chat_prompt = ChatPromptTemplate.from_messages(messages)
        retriever = UserRetriever(vectorstore=vector_store, username=username)
        memory = setup_memory()
        #memory.save_context(chat_history)
        chain = ConversationalRetrievalChain.from_llm(chat, 
            retriever=retriever, 
            memory=memory, 
            verbose=False, 
            combine_docs_chain_kwargs={
                "prompt": chat_prompt, 
                "document_prompt": PromptTemplate(
                    template=DOCUMENT_PROMPT,
                    input_variables=["page_content", "source"]
                )
            }
        )

My fields

Error Message and Stack Trace (if applicable)

Exception has occurred: KeyError
'metadata'

The error is thown in this line:

for doc in self.vectorstore.similarity_search(query, search_type="semantic_hybrid", k=NUMBER_OF_CHUNKS_TO_RETURN, filters=is_match_filter):

When I dig deep in the langchain code, I found this code:

docs = [
            (
                Document(
                    page_content=result.pop(FIELDS_CONTENT),
                    metadata={
                        **(
                            json.loads(result[FIELDS_METADATA])
                            if FIELDS_METADATA in result
                            else {
                                k: v
                                for k, v in result.items()
                                if k != FIELDS_CONTENT_VECTOR
                            }
                        ),
                        **{
                            "captions": {
                                "text": result.get("@search.captions", [{}])[0].text,
                                "highlights": result.get("@search.captions", [{}])[
                                    0
                                ].highlights,
                            }
                            if result.get("@search.captions")
                            else {},
                            "answers": semantic_answers_dict.get(
                                json.loads(result["metadata"]).get("key"),
                                "",
                            ),
                        },
                    },
                ),

As you can see in the last line, its trying to find a metadata field on the search results, which we dont have as our index is customized with our own fields.

I am blaming this line: https://github.com/langchain-ai/langchain/blob/ced5e7bae790cd9ec4e5374f5d070d9f23d6457b/libs/community/langchain_community/vectorstores/azuresearch.py#L607

@Skar0 , not sure if this is really a bug, or I missed something in the documentation.

Description

I am trying to use langchain with Azure OpenAI and Azure Search as Vector Store, and a custom retriever. I dont have a metadata field

This was working with a previous project with azure-search-documents==11.4.b09 but in a new project I am trying azure-search-documents ==11.4.0

System Info

langchain==0.1.7 langchain-community==0.0.20 langchain-core==0.1.23 langchain-openai==0.0.6 langchainhub==0.1.14

Skar0 commented 3 months ago

Hello @levalencia 😃

I have taken a look at the code and did some tests with my own index, and it indeed seems like the error you are encountering is due to the following line. https://github.com/langchain-ai/langchain/blob/ced5e7bae790cd9ec4e5374f5d070d9f23d6457b/libs/community/langchain_community/vectorstores/azuresearch.py#L607

I have created a PR https://github.com/langchain-ai/langchain/pull/18938 with a bit more context on what the bug is, where it comes from, and how I (hopefully) fixed it. It would be nice if you can test and confirm!

thelazydogsback commented 3 months ago

I'm running into a similar issue as well. I also have multiple metadata fields in the index - langchain should not make the assumption that there is only one metadata field, nor hard-code any names. I expect something like this to work if all of the following fields are in my index:

Document( page_content = "this is the text",
    Title = "DocTitle",
    Category = "Foo",
    MoreMeta1 = {"x:"1, "y":2},
    MoreMeta2 = {"z:"1, "q":2},
)

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:

The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type

Should I open a new related issue for this?

thelazydogsback commented 3 months ago

The PR you reference is changing from 'metadata' to FIELDS_ID. I'm pretty new here, but shouldn't this be FIELDS_TAG?

Skar0 commented 3 months ago

However in my case all I'm trying to do is add my documents to the index with add_texts or add_documents, and this is when I receive:
The property 'metadata' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type
Should I open a new related issue for this?

Do you create the index using the AzureSearch object ? If so, I think a "metadata" field is created by default in the index definition. You can however decide to define an index yourself

thelazydogsback commented 3 months ago

Thanks for the reply. No, I create the index in a separate pipeline outside of the python code. I don't have (nor want) one particular privileged field called "metadata" (nor only one field I can override in an env var) - there are several fields in the index which hold different types of metadata that I'd like to populate and search on separately.

paychex-ssmithrand commented 3 months ago

Also encountering this issue - and have the same set of requirements as @thelazydogsback

langchain-ai / langchain