Closed levalencia closed 11 months ago
π€
Hello,
Thank you for your detailed report. I understand that you're having issues with the field names in the AzureSearch
class in the LangChain framework.
The field names used in the AzureSearch
class are not hardcoded but are defined as constants at the top of the file: FIELDS_ID
, FIELDS_CONTENT
, FIELDS_CONTENT_VECTOR
, and FIELDS_METADATA
. These constants are set using the get_from_env
function, which retrieves the field names from environment variables. If the environment variables are not set, it uses default field names.
To use custom field names, you would need to set the corresponding environment variables to your custom field names. For example, if you want to use the field name "my_custom_id" instead of "id", you would set the AZURESEARCH_FIELDS_ID
environment variable to "my_custom_id".
Here's an example of how you might set the environment variables in Python:
import os
os.environ["AZURESEARCH_FIELDS_ID"] = "my_custom_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "my_custom_content"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "my_custom_content_vector"
os.environ["AZURESEARCH_FIELDS_METADATA"] = "my_custom_metadata"
Please note that this is a simplified example and the actual implementation may need to be more complex, depending on your specific requirements. For example, if you want to support an arbitrary number of custom fields, you might need to modify the method to accept a list of field names and values, rather than using environment variables.
I hope this helps. If you have any further questions, please don't hesitate to ask.
Best regards, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hi @levalencia , can you please share what error message are you getting?
Parameter name: vectorFields Code: Message: Unknown field 'content_vector' in vector field list.
full stack trace here
2023-08-28T09:07:23.685Z] Error occurred: () Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.685Z] Code:
[2023-08-28T09:07:23.686Z] Parameter name: vectorFields
Code:
Message: Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.687Z] Message: Unknown field 'content_vector' in vector field list.
[2023-08-28T09:07:23.688Z] Parameter name: vectorFields
[2023-08-28T09:07:23.689Z] Parameter name: vectorFields
[2023-08-28T09:07:24.130Z] Executed 'Functions.AskYourDocuments' (Failed, Id=1057e2a2-e276-4fba-b64a-ea0d7f212bfe, Duration=63922ms)
[2023-08-28T09:07:24.131Z] System.Private.CoreLib: Exception while executing function: Functions.AskYourDocuments. System.Private.CoreLib: Result: Failure
Exception: TypeError: exceptions must derive from BaseException
Stack: File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 479, in _handle__invocation_request
call_result = await self._loop.run_in_executor(
File "C:\Users\LuisValencia\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 752, in _run_sync_func
return ExtensionManager.get_sync_invocation_wrapper(context,
File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.10/WINDOWS/X64\azure_functions_worker\extension.py", line 215, in _raw_invocation_wrapper
result = function(**args)
File "C:\Users\LuisValencia\repos\LiantisCodeCopy\backend\function_app.py", line 259, in AskYourDocuments
raise func.HttpResponse(f"Error occurred {str(e)}", status_code=500)
Hey @levalencia , I added an integration test in this PR: https://github.com/langchain-ai/langchain/pull/9856/files It passes locally for me on a sample index I have in my Azure search instance. What is different in your setup? feel free to commit to the PR, you can use it to find what's wrong. Notice you'll need to set a few environment variables.
the code you send works, and its actually the same code I used.
However I get the error in:
retriever = vector_store.as_retriever(search_type="similarity", kwargs={"k": 3})
# Creating instance of RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True)
# Generating response to user's query
response = chain({"query": config['question']}) ---> HERE
I have added these line to the PR, with my instance of OpenAI in Azure, and the test still passes
@levalencia might seem silly, but can you check:
Are you setting the AZURESEARCH_FIELDS_CONTENT_VECTOR
envvar after your app is already initialized (i.e. os.environ['AZURESEARCH_FIELDS_CONTENT_VECTOR'] = "my_vector_field"
?
We found in our fastapi app, we couldn't set this after app start (despite dosu-beta comment above lol), langchain lib would already pick up what was in our app's environment.
OK part of the issue is resolved:
After fixing this I get this error instead:
[2023-08-29T06:46:37.990Z] ERROR:e61:Error occurred: You need to specify at least the following fields {'metadata': 'Edm.String'} or provide alternative field names in the env variables.
[2023-08-29T06:46:37.991Z] metadata current type: 'MISSING'. It has to be 'Edm.String' or you can point to a different 'Edm.String' field name by using the env variable 'AZURESEARCH_FIELDS_METADATA'
[2023-08-29T06:46:38.020Z] Executed 'Functions.AskYourDocuments' (Failed, Id=34e80755-5624-4bc6-9c32-6bfeda7f2f4a, Duration=35645ms)
[2023-08-29T06:46:38.022Z] System.Private.CoreLib: Exception while executing function: Functions.AskYourDocuments. System.Private.CoreLib: Result: Failure
Exception: TypeError: exceptions must derive from BaseException
Please note I dont have any metadata on my azure search index, so I didnt set that value on the local.settings.json the index looks like this:
{
"@odata.context": "https://xx.search.windows.net/indexes('e61-chunk-index')/$metadata#docs(*)",
"value": [
{
"@search.score": 1,
"id": "aHR0cHM6Ly9zcWxrbm93bGVkZ2VzdG9yZS5ibG9iLmNvcmUud2luZG93cy5uZXQvZTYxY2h1bmtpbmRleC8yL2NvbnRlbnRfY2h1bmtzXzAuanNvbg2",
"source_document_id": null,
"source_document_filepath": null,
"source_field_name": "content",
"title": null,
"index": 0,
"offset": 0,
"length": 702,
"hash": null,
"text": "Start of Profile Information: First Name: Jane, Last Name: Smith, Date Of Birth: 1985-09-20, Place Of Birth: Seattle, Country Of Birth: USA, Interest: Art, About me: Expressing myself through vari, Occupation: Artist, Education: BFA Fine Arts, Email: jane.smith@email.com, Phone number: 987-654-3210, Website: http://www.janesmithart.com, Social Media: @janesmith, Hobbies: Painting, Sculpting, Photography, Languages Spoken: English, Relationship status: Married, Profile Picture: http://www.janesmithart.com/profile.jpg, Favorite Movie: AmΓ©lie, Favorite Book: To Kill a Mockingbird, Favorite Cuisine: French, Likes: Nature, Creativity, Music, Dislikes: Spiders, Spicy food, End of Profile information",
"embedding": [
-0.0243478082,
0.009573406,
-0.02569905,
-0.02538036,
I understand the error, but what about if I want to do only vector search and not hybrid search?
@levalencia your use case matches ours (no specific metadata field indexed, metadata stored in separate fields), so feel free to try out my fork:
# Replace langchain==<version> with fork
git+https://github.com/reflection/langchain.git@expand-azure-search-results#egg=langchain&subdirectory=libs/langchain
If my PR makes sense for y'all, please comment on the PR, thanks: https://github.com/langchain-ai/langchain/pull/9894
I hope it gets merged
Hi, @levalencia,
I'm helping the LangChain team manage our backlog and am marking this issue as stale. From what I understand, the issue was raised regarding the AzureSearch.py code using constant field names instead of the ones defined by the user, causing a failure when using Azure Cognitive Search retriever with different field names. I provided a detailed response on how to use custom field names by setting corresponding environment variables, and NatanMish added an integration test in a PR. Although you encountered errors in your specific use case, reflection suggested trying out our fork to address the issue.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you!
I have no clue how i found this, as i was searching for something completely different.... The Bug seems to be in desperate need of a resolve tho....
I am not sure if i understand the problem correctly, but if there is no direct acess avaiblable as of now, In the meantime you can simply set the env Variable BEFORE you load any langchain module , more precisely the azuresearch.py ?
import os
from dotenv import find_dotenv, load_dotenv
# Load environment variables
load_dotenv(find_dotenv())
# set your varlaible in the . env file
# or do it in the script
os.environ["AZURESEARCH_FIELDS_ID"] = "my_custom_id"
os.environ["AZURESEARCH_FIELDS_CONTENT"] = "my_custom_content"
os.environ["AZURESEARCH_FIELDS_CONTENT_VECTOR"] = "my_custom_content_vector"
os.environ["AZURESEARCH_FIELDS_METADATA"] = "my_custom_metadata"
....
from langchain_community.vectorstores.azuresearch import AzureSearch
As in the azuresearch.py someone hardcoded a global variable at the top of the script
....
FIELDS_CONTENT_VECTOR = get_from_env(
key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
env_key="AZURESEARCH_FIELDS_CONTENT_VECTOR",
default="content_vector",
)
Or is this not true?
@dosubot
I still do not understand why it is impossible to set FIELDS_CONTENT_VECTOR from the code without the use of envoirment vars?
Lets asume we have a Azursesearch client. With the default Azure sds we can use for example vecotirzed querry and set the fields that are searched (given that they are vector fields):
from azure.search.documents.models import VectorizedQuery
from openai import AzureOpenAI
client = AzureOpenAI(azure_endpoint=azure_openai_endpoint,
api_version=openai_api_version,
api_key=openai_api_key)
# Azure Vector Search
query = "My Question"
embedding = client.embeddings.create(input=query, model='embedding').data[0].embedding
vector_query = VectorizedQuery(vector=embedding,
k_nearest_neighbors=1,
fields="field1, field2, field3")
results = search_client.search(
search_text=None,
vector_queries= [vector_query],
select=["id", "author", "body"],
)
Would it be possible to register a custom retriver by inheriting ffrom BaseRetriever in a way that we can use it in chains like we can with a normal retriever?
It would of course be easier if the langchain implementation would take care of this usecase?
System Info
langchain 0.0.273
Who can help?
@hwchase17
Information
Related Components
Reproduction
I am trying to use Azure Cognitive Search retriever, however it fails because our fields are different:
Our index looks like this:
Our code:
I traced it all down to the function: vector_search_with_score in azuresearch.py
The code is trying to use FIELDS_CONTENT_VECTOR which is a constant and its not our field name.
I guess some other issues may arise with other parts of the code where constants are used.
Why do we have different field names? We are using Microsoft examples to setup all azure indexing, indexers, skillsets and datasources: https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb and their open ai embedding generator deployed as an azure function: https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md
I wrote a blog post series about this https://medium.com/python-in-plain-english/elevate-chat-ai-applications-mastering-azure-cognitive-search-with-vector-storage-for-llm-a2082f24f798
Expected behavior
I should be able to define the fields we want to use, but the code uses constants