Open agazzeri opened 10 months ago
Hi @agazzeri you may be getting an error from the attempt to use AAD auth to get embeddings. You can switch this back to key-based auth by modifying the get_embedding() function in data_utils.py like so:
def get_embedding(text, embedding_model_endpoint=None, embedding_model_key=None):
endpoint = embedding_model_endpoint if embedding_model_endpoint else os.environ.get("EMBEDDING_MODEL_ENDPOINT")
key = embedding_model_key if embedding_model_key else os.environ.get("EMBEDDING_MODEL_KEY")
if endpoint is None or key is None:
raise Exception("EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY are required for embedding")
try:
endpoint_parts = endpoint.split("/openai/deployments/")
base_url = endpoint_parts[0]
deployment_id = endpoint_parts[1].split("/embeddings")[0]
openai.api_version = '2023-05-15'
openai.api_base = base_url
openai.api_type = 'azure'
openai.api_key = key
embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text)
return embeddings['data'][0]["embedding"]
except Exception as e:
raise Exception(f"Error getting embeddings with endpoint={endpoint} with error={e}")
You may need to set the EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY explicitly similarly to the Form Recognizer settings.
@agazzeri Did this solution work? I'm having the same problem because I can't create a vectorized index using PDF files.
@edercarlima No it didn't work, but honestly I didn't have too much time to troubleshoot further.
@sarah-widder
As mentioned by @agazzeri the solution does not solve the problem. I'm also having the same problem and I can't create an index using a vector with a .pdf file. Are there any fixes expected to be made available?
Note: I was able to create vectorized index using txt file
Same problem here with PDFs. Tried several different ones with the same result:
(env) ➜ sample-app-aoai-chatGPT git:(main) ✗ python scripts/data_preparation.py --config scripts/config.json
Data preparation script started
Preparing data for index: test1000
Using existing search service search-test-1000
Updated existing search index test1000
Chunking path /home/mikel.brostrom/sample-app-aoai-chatGPT/data...
Total files to process=1 out of total directory size=1
Multiprocessing with njobs=4
0%| | 0/1 [00:00<?, ?it/s]SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
'object' object has no attribute 'begin_analyze_document'
File (/home/mikel.brostrom/sample-app-aoai-chatGPT/data/2206.14651.pdf) failed with 'object' object has no attribute 'begin_analyze_document'
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 394.35it/s]
Traceback (most recent call last):
File "/home/mikel.brostrom/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 449, in <module>
create_index(index_config, credential, form_recognizer_client, embedding_model_endpoint=args.embedding_model_endpoint, use_layout=args.form_rec_use_layout, njobs=args.njobs)
File "/home/mikel.brostrom/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 396, in create_index
raise Exception("No chunks found. Please check the data path and chunk size.")
Exception: No chunks found. Please check the data path and chunk size
Any ideas @sarah-widder?
Note: I was able to create vectorized index using txt file
txt worked for me as well
I'm having the same issue, pretty unclear what happens, I'll have to dig a few hours and see what is going on.
Found the issue, it was two problem in my side, the Bicep changes I made to automate the config file creation had the ada model name and not the deployment name in the endpoint generation, this resulted in wrong endpoint, but this code very happily ignores and sleeps without printing anything, second issue I see is hitting the quota limit which is probably the original author thought would be the only issue happening!
so, I'll create a merge request later for adding :
if add_embeddings:
for _ in range(RETRY_COUNT): try: doc.contentVector = get_embedding(chunk, azure_credential=azure_credential, embedding_model_endpoint=embedding_endpoint)
doc.contentVector = get_embedding_by_key(chunk, embedding_model_endpoint=embedding_endpoint)
break except Exception as e: print(f"\nError getting embedding for chunk={chunk} with error={e}")
TODO: Detect which errors are retryable
time.sleep(30) if doc.contentVector is None: raise Exception(f"Error getting embedding for chunk={chunk}")
So this error:
Exception: No chunks found. Please check the data path and chunk size
could be due to no quota available?
As @sarah-widder said, it was an issue related to AAD authentication. The get-embedding function can be modified and used as follows.
if azure_credential is None and (endpoint is None or key is None):
raise Exception("EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY are required for embedding")
try:
endpoint_parts = endpoint.split("/openai/deployments/")
base_url = endpoint_parts[0]
deployment_id = endpoint_parts[1].split("/embeddings")[0]
openai.api_version = '2023-05-15'
openai.api_base = base_url
# if azure_credential is not None:
# openai.api_key = azure_credential.get_token("https://cognitiveservices.azure.com/.default").token
# openai.api_type = "azure_ad"
# else:
openai.api_type = 'azure'
openai.api_key = "<you key>" # insert your key
embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text)
return embeddings['data'][0]["embedding"]
except Exception as e:
raise Exception(f"Error getting embeddings with endpoint={endpoint} with error={e}")
I am still experiencing this error even after all above suggested changes. I also tried to hard code endpoint
& key
in the data_utils.py
0%| | 0/1 [02:30<?, ?it/s]
Traceback (most recent call last):
File "/Project/sample-app-aoai-chatGPT/./scripts/prepdocs.py", line 233, in <module>
create_and_populate_index(
File "/Project/sample-app-aoai-chatGPT/./scripts/prepdocs.py", line 137, in create_and_populate_index
result = chunk_directory(
^^^^^^^^^^^^^^^^
File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 980, in chunk_directory
result, is_error = process_file(file_path=file_path,directory_path=directory_path, ignore_errors=ignore_errors,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 865, in process_file
result = chunk_file(
^^^^^^^^^^^
File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 821, in chunk_file
return chunk_content(
^^^^^^^^^^^^^^
File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 767, in chunk_content
raise e
File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 742, in chunk_content
raise Exception(f"Error getting embedding for chunk={chunk}")
I had the same issue and as @sarah-widder says it is an issue with the authentication using the AzureCLI. In my case I had not added the role of "Cognitive Services OpenAI User" to the Azure OpenAI resource I was trying to use and then ensured that the current user was assigned to that role.
So i have Contributor role for my Azure OpenAI resource. Will it work with Azure CLI or do I need to update that to owner or User Access Administration? I am getting the same error as shown above ?
I had the same issue and as @sarah-widder says it is an issue with the authentication using the AzureCLI. In my case I had not added the role of "Cognitive Services OpenAI User" to the Azure OpenAI resource I was trying to use and then ensured that the current user was assigned to that role.
This is for RBAC roles. I have contributor access on Azure OpenAI resource.
Here is how it worked for me:
Setting the Azure OpenAI key as environment variable: export EMBEDDING_MODEL_KEY=YOUR_AOAI_ENVKEY
Change the code for get_embedding() in data_utils.py to use the api_key instead of token (Don't use azure_ad_token
in while creating AzureOpenAI
client).
# if azure_credential is not None:
# api_key = azure_credential.get_token("https://cognitiveservices.azure.com/.default").token
# else:
api_key = key
client = AzureOpenAI(api_version=api_version, azure_endpoint=base_url, api_key=api_key)
embeddings = client.embeddings.create(model=deployment_id, input=text)
return embeddings.dict()['data'][0]['embedding']
EMBEDDING_MODEL_ENDPOINT
) as well:
python data_preparation.py --config config.json --njobs=1 --embedding-model-endpoint "https://<oai_env_name>.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-02-01"
Here is the diff for the code change:
I'm trying to use the data_preparation.py script to ingest PDF files into a search index for use with OpenAI. I want to include PDF analysis with Document Intelligence (Form Recognizer), chunking and vectoring, so my config.json file includes "vector_config_name": "default".
I run the script with the following command:
python3 data_preparation.py --config config.json --embedding-model-endpoint "https://xxxxx.openai.azure.com/openai/deployments/TextEmbedding/embeddings?api-version=2023-06-01-preview" --form-rec-resource "myformrecname" --form-rec-key myformreckey --njobs=4
I have 4 pdf files in the data_path folder and I'm running in a WSL with Ubuntu distro. The script fails with the following (I've replaced my details with ***):
Update: I was able to make one step further by creating environment variables for FORM_RECOGNIZER_ENDPOINT and FORM_RECOGNIZER_KEY with the same value passed as inputs to the script (for some reason it fails getting those from the input parameters). I did some further debugging and the line that fails now is in the get_embedding function in data_utils.py: embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text) after it gets a valid OpenAI API key. Error is:
Any hint? Thanks in advance!