langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.24k stars 15.46k forks source link

Using AzureSearch with custom vector field names #14298

Open levalencia opened 11 months ago

levalencia commented 11 months ago

System Info

azure-search-documents==11.4.0b9 langchain 0.0.342 langchain-core 0.0.7

Who can help?

@hwc

Information

Related Components

Reproduction

My local.settings.json has the custom field names for Azure Cognitive Search:

{
  "IsEncrypted": false,
  "Values": {
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "AZURESEARCH_FIELDS_ID" :"chunk_id",
    "AZURESEARCH_FIELDS_CONTENT" :"chunk",
    "AZURESEARCH_FIELDS_CONTENT_VECTOR " :"vector",
    "AZURESEARCH_FIELDS_TAG" :"metadata",
    "FIELDS_ID" : "chunk_id",
    "FIELDS_CONTENT" : "chunk",
    "FIELDS_CONTENT_VECTOR" : "vector",
    "FIELDS_METADATA" : "metadata",
    "AzureWebJobsStorage": "UseDevelopmentStorage=true",
    "AzureWebJobsFeatureFlags": "EnableWorkerIndexing"
  }
}

I also tried to create a Fields array and pass it into the AzureSearch constructor like this:

 os.environ["AZURE_OPENAI_API_KEY"] = "xx"
            os.environ["AZURE_OPENAI_ENDPOINT"] = "https://xx.openai.azure.com/"
            embeddings = AzureOpenAIEmbeddings(
                azure_deployment="text-embedding-ada-002",
                openai_api_version="2023-05-15",
            )

            fields = [
                SimpleField(
                    name="chunk_id",
                    type=SearchFieldDataType.String,
                    key=True,
                    filterable=True,
                ),
                SearchableField(
                    name="chunk",
                    type=SearchFieldDataType.String,
                    searchable=True,
                ),
                SearchField(
                    name="vector",
                    type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                    searchable=True,
                    vector_search_dimensions=1536,
                    vector_search_configuration="default",
                )
            ]

            FIELDS_ID = get_from_env(
                key="AZURESEARCH_FIELDS_ID", env_key="AZURESEARCH_FIELDS_ID", default="id"
            )
            FIELDS_CONTENT = get_from_env(
                key="AZURESEARCH_FIELDS_CONTENT",
                env_key="AZURESEARCH_FIELDS_CONTENT",
                default="content",
            )
            FIELDS_CONTENT_VECTOR = get_from_env(
                key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
                env_key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
                default="content_vector",
            )
            FIELDS_METADATA = get_from_env(
                key="AZURESEARCH_FIELDS_TAG", env_key="AZURESEARCH_FIELDS_TAG", default="metadata"
            )

            vector_store_address: str = "https://xx.search.windows.net"
            vector_store_password: str = "xx"
            vector_store: AzureSearch = AzureSearch(
                azure_search_endpoint=vector_store_address,
                azure_search_key=vector_store_password,
                index_name="vector-1701341754619",
                fiekds=fields,
                embedding_function=embeddings.embed_query
            )
            llm = AzureChatOpenAI(
                azure_deployment="chat",
                openai_api_version="2023-05-15",
            )  
            chain = RetrievalQA.from_chain_type(llm=llm,
                                                chain_type="stuff",
                                                retriever=Element61Retriever(vectorstore=vector_store),
                                                return_source_documents=True)
            result = chain({"query": 'Whats out of scope?'})
            return result

However I am always getting:

Executed 'Functions.TestCustomRetriever' (Failed, Id=2f243ed8-24bd-414b-af51-6cf1419633a5, Duration=6900ms)
[2023-12-05T15:08:53.252Z] System.Private.CoreLib: Exception while executing function: Functions.TestCustomRetriever. System.Private.CoreLib: Result: Failure
Exception: HttpResponseError: (InvalidRequestParameter) Unknown field 'content_vector' in vector field list.
Code: InvalidRequestParameter
Message: Unknown field 'content_vector' in vector field list.
Exception Details:      (UnknownField) Unknown field 'content_vector' in vector field list.
        Code: UnknownField

Please note this is being executed in an Azure Function locally

Expected behavior

The custom field names should be taken into account

dosubot[bot] commented 11 months ago

🤖

Hello again @levalencia! I see you've come across another issue. Let's take a look and see what we can do to help.

Based on the information you provided and the context from the LangChain repository, it seems like the issue is related to the environment variables not being set correctly. The error message "Unknown field 'content_vector' in vector field list" suggests that the AzureSearch is trying to use the default field name "content_vector" instead of your custom field name.

In your code, you are setting the environment variables after importing the AzureSearch module. However, the environment variables should be set before importing the module. This is because the field names are fetched from the environment variables during the import process. If the environment variables are set after the import, the default field names will be used instead.

Here is the correct way to set the environment variables:

import os

os.environ["AZURESEARCH_FIELDS_ID"] = "chunk_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "chunk"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "vector"
os.environ["AZURESEARCH_FIELDS_TAG"] = "metadata"

from langchain.vectorstores import AzureSearch

Replace "chunk_id", "chunk", "vector", and "metadata" with your actual custom field names.

Also, please note that there is a typo in your AzureSearch constructor. You wrote fiekds=fields, but it should be fields=fields.

If you still encounter issues after making these changes, please provide more information so I can further assist you.

For more details, you can refer to these similar solved issues in the LangChain repository:

I hope this helps! If you have any other questions, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

kaito0228 commented 8 months ago

I too am facing this problem. Do you have any plans to resolve it? I just want to have a reference to the VECTOR FIELDS that we specify.

alexmanie commented 7 months ago

Hi @levalencia , Are you still having this issue? Are you trying to execute that piece of code inside an Azure Function app?

When you run that piece of code in your solution, what is the value you get for "FIELDS_CONTENT_VECTOR" variable?

FIELDS_CONTENT_VECTOR = get_from_env(
               key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
               env_key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
               default="content_vector",
           )
gurvinder-dhillon commented 5 months ago

Did anyone ever found a solution to this. None of the suggested solution work.

levalencia commented 5 months ago
import os

os.environ["AZURESEARCH_FIELDS_ID"] = "chunk_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "chunk"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "vector"
os.environ["AZURESEARCH_FIELDS_TAG"] = "metadata"

from langchain.vectorstores import AzureSearch

this worked for me

Hadi2525 commented 4 months ago

I am also getting the same error. It sounds like, the index fields should follow the environment variables set like this:

os.environ["AZURESEARCH_FIELDS_ID"] 
os.environ["AZURESEARCH_FIELDS_CONTENT"] 
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] 
os.environ["AZURESEARCH_FIELDS_TAG"] 

and there is no way to apply a custom field name based on my built index on Azure Cognitive Search.

pulling @hwchase17 and @pamelafox and @marlenezw for their insights.

marlenezw commented 4 months ago

Thanks for the tag @Hadi2525, let me take a look

ANMOL2001A commented 3 months ago

what if we don't want to set enviroment variable can we pass our own feilds instead of default feilds

Hadi2525 commented 3 months ago

I am still waiting for the relevant contributor to chime in. Maybe @marlenezw if you have any updates on this issue please let us know thank you!

rcruzgar commented 2 months ago

Hi! any updates about this error? I am trying to make a multivectorial search and I am having the same issue. Do you know if it has been fixed in the latest langchain releases? (I don't think so: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/azuresearch.py). Setting the environment variable before importing the class works for me, but I need to make the search on few vectors, not just one. AzureSearch should take some parameter like "vector_field_name" to dynamically construct vector stores, instead of restricting it to only one previously set as environment variable.

Hadi2525 commented 2 months ago

Hi! any updates about this error? I am trying to make a multivectorial search and I am having the same issue. Do you know if it has been fixed in the latest langchain releases? (I don't think so: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/azuresearch.py). Setting the environment variable before importing the class works for me, but I need to make the search on few vectors, not just one. AzureSearch should take some parameter like "vector_field_name" to dynamically construct vector stores, instead of restricting it to only one previously set as environment variable.

Hey @rcruzgar yea I don't see any developers focusing on this issue. I will try to take a stab at it and put a PR out soon. Stay tuned.

jo-jstrm commented 4 days ago

I am also facing this issue. My current workaround is not changing the environment variables, but rather directly update the global variables that are instantiated based on the env variables:

import langchain_community.vectorstores.azuresearch as azuresearch

azuresearch.FIELDS_CONTENT_VECTOR='whatever_your_field_is_named'
# Will use 'whatever_your_field_is_named' for retrieval
search = azuresearch.AzureSearch(...)

azuresearch.FIELDS_CONTENT_VECTOR='even_better_name'
# Will use 'even_better_name' for retrieval
search = azuresearch.AzureSearch(...)

This way, you can easily change the field names during runtime and save the roundtrip via environment vars.