AzureSearch.py is using constant field names instead of ours

levalencia commented 1 year ago

System Info

langchain 0.0.273

Who can help?

@hwchase17

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

I am trying to use Azure Cognitive Search retriever, however it fails because our fields are different:

Our index looks like this:

Our code:

        llm = AzureChatOpenAI(
            openai_api_base=config['AZURE_OPENAI_ENDPOINT'],
            openai_api_version=config['AZURE_OPENAI_API_VERSION'],
            deployment_name=config['OPENAI_DEPLOYMENT_NAME'],
            openai_api_key=config['AZURE_OPENAI_API_KEY'],
            openai_api_type=config['OPENAI_API_TYPE'],
            model_name=config['OPENAI_MODEL_NAME'],
            temperature=0)

        embeddings = OpenAIEmbeddings(
                            openai_api_base=config['AZURE_OPENAI_ENDPOINT'],
                            openai_api_type="azure",
                            deployment=config['AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME'],
                            openai_api_key=config['AZURE_OPENAI_API_KEY'],
                            model=config['AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME'],
                            chunk_size=1)

        fields = [
            SimpleField(
                name="id",
                type=SearchFieldDataType.String,
                key=True,
                filterable=True,
            ),
            SearchableField(
                name="text",
                type=SearchFieldDataType.String,
                searchable=True,
            ),
            SearchField(
                name="embedding",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,
                vector_search_dimensions=1536,
                vector_search_configuration="default",
            )
        ]

        vector_store: AzureSearch = AzureSearch(
            azure_search_endpoint=config['AZURE_SEARCH_SERVICE_ENDPOINT'],
            azure_search_key=config['AZURE_SEARCH_ADMIN_KEY'],
            index_name=config['AZURE_SEARCH_VECTOR_INDEX_NAME'],
            embedding_function=embeddings.embed_query,
            fields=fields,
        )

        retriever = vector_store.as_retriever(search_type="similarity", kwargs={"k": 3})

        # Creating instance of RetrievalQA
        chain = RetrievalQA.from_chain_type(llm=llm,
                                            chain_type="stuff",
                                            retriever=retriever,
                                            return_source_documents=True)

        # Generating response to user's query
        response = chain({"query": config['question']})

I traced it all down to the function: vector_search_with_score in azuresearch.py

        results = self.client.search(
            search_text="",
            vectors=[
                Vector(
                    value=np.array(
                        self.embedding_function(query), dtype=np.float32
                    ).tolist(),
                    k=k,
                    fields=FIELDS_CONTENT_VECTOR,
                )
            ],
            select=[FIELDS_ID, FIELDS_CONTENT, FIELDS_METADATA],
            filter=filters,
        )

The code is trying to use FIELDS_CONTENT_VECTOR which is a constant and its not our field name.

I guess some other issues may arise with other parts of the code where constants are used.

Why do we have different field names? We are using Microsoft examples to setup all azure indexing, indexers, skillsets and datasources: https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb and their open ai embedding generator deployed as an azure function: https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md

I wrote a blog post series about this https://medium.com/python-in-plain-english/elevate-chat-ai-applications-mastering-azure-cognitive-search-with-vector-storage-for-llm-a2082f24f798

Expected behavior

I should be able to define the fields we want to use, but the code uses constants

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for your detailed report. I understand that you're having issues with the field names in the AzureSearch class in the LangChain framework.

The field names used in the AzureSearch class are not hardcoded but are defined as constants at the top of the file: FIELDS_ID, FIELDS_CONTENT, FIELDS_CONTENT_VECTOR, and FIELDS_METADATA. These constants are set using the get_from_env function, which retrieves the field names from environment variables. If the environment variables are not set, it uses default field names.

To use custom field names, you would need to set the corresponding environment variables to your custom field names. For example, if you want to use the field name "my_custom_id" instead of "id", you would set the AZURESEARCH_FIELDS_ID environment variable to "my_custom_id".

Here's an example of how you might set the environment variables in Python:

import os

os.environ["AZURESEARCH_FIELDS_ID"] = "my_custom_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "my_custom_content"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "my_custom_content_vector"
os.environ["AZURESEARCH_FIELDS_METADATA"] = "my_custom_metadata"

Please note that this is a simplified example and the actual implementation may need to be more complex, depending on your specific requirements. For example, if you want to support an arbitrary number of custom fields, you might need to modify the method to accept a list of field names and values, rather than using environment variables.

I hope this helps. If you have any further questions, please don't hesitate to ask.

Best regards, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

NatanMish commented 1 year ago

Hi @levalencia , can you please share what error message are you getting?

levalencia commented 1 year ago

Parameter name: vectorFields Code: Message: Unknown field 'content_vector' in vector field list.

levalencia commented 1 year ago

full stack trace here

2023-08-28T09:07:23.685Z] Error occurred: () Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.685Z] Code:
[2023-08-28T09:07:23.686Z] Parameter name: vectorFields
Code:
Message: Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.687Z] Message: Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.688Z] Parameter name: vectorFields
[2023-08-28T09:07:23.689Z] Parameter name: vectorFields
[2023-08-28T09:07:24.130Z] Executed 'Functions.AskYourDocuments' (Failed, Id=1057e2a2-e276-4fba-b64a-ea0d7f212bfe, Duration=63922ms)
[2023-08-28T09:07:24.131Z] System.Private.CoreLib: Exception while executing function: Functions.AskYourDocuments. System.Private.CoreLib: Result: Failure
Exception: TypeError: exceptions must derive from BaseException
Stack:   File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 479, in _handle__invocation_request
    call_result = await self._loop.run_in_executor(
  File "C:\Users\LuisValencia\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 752, in _run_sync_func
    return ExtensionManager.get_sync_invocation_wrapper(context,
  File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\extension.py", line 215, in _raw_invocation_wrapper
    result = function(**args)
  File "C:\Users\LuisValencia\repos\LiantisCodeCopy\backend\function_app.py", line 259, in AskYourDocuments
    raise func.HttpResponse(f"Error occurred {str(e)}", status_code=500)

NatanMish commented 1 year ago

Hey @levalencia , I added an integration test in this PR: https://github.com/langchain-ai/langchain/pull/9856/files It passes locally for me on a sample index I have in my Azure search instance. What is different in your setup? feel free to commit to the PR, you can use it to find what's wrong. Notice you'll need to set a few environment variables.

levalencia commented 1 year ago

the code you send works, and its actually the same code I used.

However I get the error in:


 retriever = vector_store.as_retriever(search_type="similarity", kwargs={"k": 3})

        # Creating instance of RetrievalQA
        chain = RetrievalQA.from_chain_type(llm=llm,
                                            chain_type="stuff",
                                            retriever=retriever,
                                            return_source_documents=True)

        # Generating response to user's query
        response = chain({"query": config['question']})  ---> HERE

NatanMish commented 1 year ago

I have added these line to the PR, with my instance of OpenAI in Azure, and the test still passes

reflection commented 1 year ago

@levalencia might seem silly, but can you check: Are you setting the AZURESEARCH_FIELDS_CONTENT_VECTOR envvar after your app is already initialized (i.e. os.environ['AZURESEARCH_FIELDS_CONTENT_VECTOR'] = "my_vector_field"?

We found in our fastapi app, we couldn't set this after app start (despite dosu-beta comment above lol), langchain lib would already pick up what was in our app's environment.

levalencia commented 1 year ago

OK part of the issue is resolved:

I was setting the variables in the .env instead of the local.settings.json.

After fixing this I get this error instead:

[2023-08-29T06:46:37.990Z] ERROR:e61:Error occurred: You need to specify at least the following fields {'metadata': 'Edm.String'} or provide alternative field names in the env variables.
[2023-08-29T06:46:37.991Z] metadata current type: 'MISSING'. It has to be 'Edm.String' or you can point to a different 'Edm.String' field name by using the env variable 'AZURESEARCH_FIELDS_METADATA'
[2023-08-29T06:46:38.020Z] Executed 'Functions.AskYourDocuments' (Failed, Id=34e80755-5624-4bc6-9c32-6bfeda7f2f4a, Duration=35645ms)
[2023-08-29T06:46:38.022Z] System.Private.CoreLib: Exception while executing function: Functions.AskYourDocuments. System.Private.CoreLib: Result: Failure
Exception: TypeError: exceptions must derive from BaseException

Please note I dont have any metadata on my azure search index, so I didnt set that value on the local.settings.json the index looks like this:


{
  "@odata.context": "https://xx.search.windows.net/indexes('e61-chunk-index')/$metadata#docs(*)",
  "value": [
    {
      "@search.score": 1,
      "id": "aHR0cHM6Ly9zcWxrbm93bGVkZ2VzdG9yZS5ibG9iLmNvcmUud2luZG93cy5uZXQvZTYxY2h1bmtpbmRleC8yL2NvbnRlbnRfY2h1bmtzXzAuanNvbg2",
      "source_document_id": null,
      "source_document_filepath": null,
      "source_field_name": "content",
      "title": null,
      "index": 0,
      "offset": 0,
      "length": 702,
      "hash": null,
      "text": "Start of Profile Information: First Name: Jane, Last Name: Smith, Date Of Birth: 1985-09-20, Place Of Birth: Seattle, Country Of Birth: USA, Interest: Art, About me: Expressing myself through vari, Occupation: Artist, Education: BFA Fine Arts, Email: jane.smith@email.com, Phone number: 987-654-3210, Website: http://www.janesmithart.com, Social Media: @janesmith, Hobbies: Painting, Sculpting, Photography, Languages Spoken: English, Relationship status: Married, Profile Picture: http://www.janesmithart.com/profile.jpg, Favorite Movie: Amélie, Favorite Book: To Kill a Mockingbird, Favorite Cuisine: French, Likes: Nature, Creativity, Music, Dislikes: Spiders, Spicy food, End of Profile information",
      "embedding": [
        -0.0243478082,
        0.009573406,
        -0.02569905,
        -0.02538036,

I understand the error, but what about if I want to do only vector search and not hybrid search?

reflection commented 1 year ago

@levalencia your use case matches ours (no specific metadata field indexed, metadata stored in separate fields), so feel free to try out my fork:

# Replace langchain==<version> with fork
git+https://github.com/reflection/langchain.git@expand-azure-search-results#egg=langchain&subdirectory=libs/langchain

If my PR makes sense for y'all, please comment on the PR, thanks: https://github.com/langchain-ai/langchain/pull/9894

levalencia commented 1 year ago

I hope it gets merged

dosubot[bot] commented 11 months ago

Hi, @levalencia,

I'm helping the LangChain team manage our backlog and am marking this issue as stale. From what I understand, the issue was raised regarding the AzureSearch.py code using constant field names instead of the ones defined by the user, causing a failure when using Azure Cognitive Search retriever with different field names. I provided a detailed response on how to use custom field names by setting corresponding environment variables, and NatanMish added an integration test in a PR. Although you encountered errors in your specific use case, reflection suggested trying out our fork to address the issue.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

JamesLarkinWhite commented 6 months ago

I have no clue how i found this, as i was searching for something completely different.... The Bug seems to be in desperate need of a resolve tho....

I am not sure if i understand the problem correctly, but if there is no direct acess avaiblable as of now, In the meantime you can simply set the env Variable BEFORE you load any langchain module , more precisely the azuresearch.py ?

import os

from dotenv import find_dotenv, load_dotenv

# Load environment variables
load_dotenv(find_dotenv()) 

# set your varlaible  in the . env file

# or do it in the script 

os.environ["AZURESEARCH_FIELDS_ID"] = "my_custom_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "my_custom_content"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "my_custom_content_vector"
os.environ["AZURESEARCH_FIELDS_METADATA"] =  "my_custom_metadata"

....
from langchain_community.vectorstores.azuresearch import AzureSearch

As in the azuresearch.py someone hardcoded a global variable at the top of the script

....
FIELDS_CONTENT_VECTOR = get_from_env(
    key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
    env_key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
    default="content_vector",
)

Or is this not true?

JamesLarkinWhite commented 4 months ago

@dosubot

I still do not understand why it is impossible to set FIELDS_CONTENT_VECTOR from the code without the use of envoirment vars?

Lets asume we have a Azursesearch client. With the default Azure sds we can use for example vecotirzed querry and set the fields that are searched (given that they are vector fields):

from azure.search.documents.models import VectorizedQuery
from openai import AzureOpenAI

client = AzureOpenAI(azure_endpoint=azure_openai_endpoint,
                    api_version=openai_api_version,
                    api_key=openai_api_key)

# Azure Vector Search
query = "My Question"  

embedding = client.embeddings.create(input=query, model='embedding').data[0].embedding
vector_query = VectorizedQuery(vector=embedding, 
                                                    k_nearest_neighbors=1, 
                                                    fields="field1, field2, field3")

results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["id", "author", "body"],
)

Would it be possible to register a custom retriver by inheriting ffrom BaseRetriever in a way that we can use it in chains like we can with a normal retriever?

It would of course be easier if the langchain implementation would take care of this usecase?

langchain-ai / langchain