Data Preparation on PDF files fails

agazzeri commented 10 months ago

I'm trying to use the data_preparation.py script to ingest PDF files into a search index for use with OpenAI. I want to include PDF analysis with Document Intelligence (Form Recognizer), chunking and vectoring, so my config.json file includes "vector_config_name": "default".

I run the script with the following command: python3 data_preparation.py --config config.json --embedding-model-endpoint "https://xxxxx.openai.azure.com/openai/deployments/TextEmbedding/embeddings?api-version=2023-06-01-preview" --form-rec-resource "myformrecname" --form-rec-key myformreckey --njobs=4

I have 4 pdf files in the data_path folder and I'm running in a WSL with Ubuntu distro. The script fails with the following (I've replaced my details with ***):

Data preparation script started
Using Form Recognizer resource *** for PDF cracking, with the Read model.
Preparing data for index: ***
Using existing search service ***
Created search index ***
Chunking directory...
Total files to process=4 out of total directory size=4
Multiprocessing with njobs=4
SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
  0%|                                                                                             | 0/4 [00:00<?, ?it/s]File (/mnt/c/***.pdf) failed with  'object' object has no attribute 'begin_analyze_document'
File (/mnt/c/***.pdf) failed with  'object' object has no attribute 'begin_analyze_document'
File (/mnt/c/***.pdf) failed with  'object' object has no attribute 'begin_analyze_document'
File (/mnt/c/***.pdf) failed with  'object' object has no attribute 'begin_analyze_document'
100%|█████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 91.52it/s]
Traceback (most recent call last):
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 424, in <module>
    create_index(index_config, credential, form_recognizer_client, embedding_model_endpoint=args.embedding_model_endpoint, use_layout=args.form_rec_use_layout, njobs=args.njobs)
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 371, in create_index
    raise Exception("No chunks found. Please check the data path and chunk size.")
Exception: No chunks found. Please check the data path and chunk size.

Update: I was able to make one step further by creating environment variables for FORM_RECOGNIZER_ENDPOINT and FORM_RECOGNIZER_KEY with the same value passed as inputs to the script (for some reason it fails getting those from the input parameters). I did some further debugging and the line that fails now is in the get_embedding function in data_utils.py: embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text) after it gets a valid OpenAI API key. Error is:

Traceback (most recent call last):
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 424, in <module>
    create_index(index_config, credential, form_recognizer_client, embedding_model_endpoint=args.embedding_model_endpoint, use_layout=args.form_rec_use_layout, njobs=args.njobs)
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 366, in create_index
    result = chunk_directory(config["data_path"], num_tokens=config["chunk_size"], token_overlap=config.get("token_overlap",0),
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_utils.py", line 758, in chunk_directory
    result, is_error = process_file(file_path=file_path,directory_path=directory_path, ignore_errors=ignore_errors,
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_utils.py", line 683, in process_file
    result = chunk_file(
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_utils.py", line 639, in chunk_file
    return chunk_content(
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_utils.py", line 584, in chunk_content
    raise e
  File "/mnt/c/src/sample-app-aoai-chatGPT/scripts/data_utils.py", line 559, in chunk_content
    raise Exception(f"Error getting embedding for chunk={chunk}")
Exception: Error getting embedding for chunk=[...]

Any hint? Thanks in advance!

sarah-widder commented 10 months ago

Hi @agazzeri you may be getting an error from the attempt to use AAD auth to get embeddings. You can switch this back to key-based auth by modifying the get_embedding() function in data_utils.py like so:

def get_embedding(text, embedding_model_endpoint=None, embedding_model_key=None):
    endpoint = embedding_model_endpoint if embedding_model_endpoint else os.environ.get("EMBEDDING_MODEL_ENDPOINT")
    key = embedding_model_key if embedding_model_key else os.environ.get("EMBEDDING_MODEL_KEY")

    if endpoint is None or key is None:
        raise Exception("EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY are required for embedding")

    try:
        endpoint_parts = endpoint.split("/openai/deployments/")
        base_url = endpoint_parts[0]
        deployment_id = endpoint_parts[1].split("/embeddings")[0]

        openai.api_version = '2023-05-15'
        openai.api_base = base_url
        openai.api_type = 'azure'
        openai.api_key = key

        embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text)
        return embeddings['data'][0]["embedding"]

    except Exception as e:
        raise Exception(f"Error getting embeddings with endpoint={endpoint} with error={e}")

You may need to set the EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY explicitly similarly to the Form Recognizer settings.

edercarlima commented 9 months ago

@agazzeri Did this solution work? I'm having the same problem because I can't create a vectorized index using PDF files.

agazzeri commented 9 months ago

@edercarlima No it didn't work, but honestly I didn't have too much time to troubleshoot further.

edercarlima commented 9 months ago

@sarah-widder

As mentioned by @agazzeri the solution does not solve the problem. I'm also having the same problem and I can't create an index using a vector with a .pdf file. Are there any fixes expected to be made available?

Note: I was able to create vectorized index using txt file

mikel-brostrom commented 9 months ago

Same problem here with PDFs. Tried several different ones with the same result:

(env) ➜  sample-app-aoai-chatGPT git:(main) ✗ python scripts/data_preparation.py --config scripts/config.json
Data preparation script started
Preparing data for index: test1000
Using existing search service search-test-1000
Updated existing search index test1000
Chunking path /home/mikel.brostrom/sample-app-aoai-chatGPT/data...
Total files to process=1 out of total directory size=1
Multiprocessing with njobs=4
  0%|                                                                                                                       | 0/1 [00:00<?, ?it/s]SingletonFormRecognizerClient: Creating instance of Form recognizer per process
SingletonFormRecognizerClient: Skipping since credentials not provided. Assuming NO form recognizer extensions(like .pdf) in directory
'object' object has no attribute 'begin_analyze_document'
File (/home/mikel.brostrom/sample-app-aoai-chatGPT/data/2206.14651.pdf) failed with  'object' object has no attribute 'begin_analyze_document'
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 394.35it/s]
Traceback (most recent call last):
  File "/home/mikel.brostrom/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 449, in <module>
    create_index(index_config, credential, form_recognizer_client, embedding_model_endpoint=args.embedding_model_endpoint, use_layout=args.form_rec_use_layout, njobs=args.njobs)
  File "/home/mikel.brostrom/sample-app-aoai-chatGPT/scripts/data_preparation.py", line 396, in create_index
    raise Exception("No chunks found. Please check the data path and chunk size.")
Exception: No chunks found. Please check the data path and chunk size

Any ideas @sarah-widder?

Note: I was able to create vectorized index using txt file

txt worked for me as well

thenewnano commented 8 months ago

I'm having the same issue, pretty unclear what happens, I'll have to dig a few hours and see what is going on.

thenewnano commented 8 months ago

Found the issue, it was two problem in my side, the Bicep changes I made to automate the config file creation had the ada model name and not the deployment name in the endpoint generation, this resulted in wrong endpoint, but this code very happily ignores and sleeps without printing anything, second issue I see is hitting the quota limit which is probably the original author thought would be the only issue happening!

so, I'll create a merge request later for adding :

            if add_embeddings:
for _ in range(RETRY_COUNT): try: doc.contentVector = get_embedding(chunk, azure_credential=azure_credential, embedding_model_endpoint=embedding_endpoint)

doc.contentVector = get_embedding_by_key(chunk, embedding_model_endpoint=embedding_endpoint)

break except Exception as e: print(f"\nError getting embedding for chunk={chunk} with error={e}")

TODO: Detect which errors are retryable

time.sleep(30) if doc.contentVector is None: raise Exception(f"Error getting embedding for chunk={chunk}")

mikel-brostrom commented 8 months ago

So this error:

Exception: No chunks found. Please check the data path and chunk size

could be due to no quota available?

ldwv6 commented 7 months ago

As @sarah-widder said, it was an issue related to AAD authentication. The get-embedding function can be modified and used as follows.

if azure_credential is None and (endpoint is None or key is None):
    raise Exception("EMBEDDING_MODEL_ENDPOINT and EMBEDDING_MODEL_KEY are required for embedding")

try:
    endpoint_parts = endpoint.split("/openai/deployments/")
    base_url = endpoint_parts[0]
    deployment_id = endpoint_parts[1].split("/embeddings")[0]

    openai.api_version = '2023-05-15'
    openai.api_base = base_url

    # if azure_credential is not None:
    #     openai.api_key = azure_credential.get_token("https://cognitiveservices.azure.com/.default").token
    #     openai.api_type = "azure_ad"
    # else:
    openai.api_type = 'azure'
    openai.api_key = "<you key>" # insert your key 

    embeddings = openai.Embedding.create(deployment_id=deployment_id, input=text)
    return embeddings['data'][0]["embedding"]

except Exception as e:
    raise Exception(f"Error getting embeddings with endpoint={endpoint} with error={e}")

Roydon commented 7 months ago

I am still experiencing this error even after all above suggested changes. I also tried to hard code endpoint & key in the data_utils.py

  0%|                                                                                                                                                                                           | 0/1 [02:30<?, ?it/s]
Traceback (most recent call last):
  File "/Project/sample-app-aoai-chatGPT/./scripts/prepdocs.py", line 233, in <module>
    create_and_populate_index(
  File "/Project/sample-app-aoai-chatGPT/./scripts/prepdocs.py", line 137, in create_and_populate_index
    result = chunk_directory(
             ^^^^^^^^^^^^^^^^
  File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 980, in chunk_directory
    result, is_error = process_file(file_path=file_path,directory_path=directory_path, ignore_errors=ignore_errors,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 865, in process_file
    result = chunk_file(
             ^^^^^^^^^^^
  File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 821, in chunk_file
    return chunk_content(
           ^^^^^^^^^^^^^^
  File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 767, in chunk_content
    raise e
  File "/Project/sample-app-aoai-chatGPT/scripts/data_utils.py", line 742, in chunk_content
    raise Exception(f"Error getting embedding for chunk={chunk}")

jimma72 commented 4 months ago

I had the same issue and as @sarah-widder says it is an issue with the authentication using the AzureCLI. In my case I had not added the role of "Cognitive Services OpenAI User" to the Azure OpenAI resource I was trying to use and then ensured that the current user was assigned to that role.

fitnesswanderer commented 3 months ago

So i have Contributor role for my Azure OpenAI resource. Will it work with Azure CLI or do I need to update that to owner or User Access Administration? I am getting the same error as shown above ?

fitnesswanderer commented 3 months ago

I had the same issue and as @sarah-widder says it is an issue with the authentication using the AzureCLI. In my case I had not added the role of "Cognitive Services OpenAI User" to the Azure OpenAI resource I was trying to use and then ensured that the current user was assigned to that role.

This is for RBAC roles. I have contributor access on Azure OpenAI resource.

arnabbiswas1 commented 2 months ago

Here is how it worked for me:

Setting the Azure OpenAI key as environment variable: export EMBEDDING_MODEL_KEY=YOUR_AOAI_ENVKEY
Change the code for get_embedding() in data_utils.py to use the api_key instead of token (Don't use azure_ad_token in while creating AzureOpenAI client).

        # if azure_credential is not None:
        #     api_key = azure_credential.get_token("https://cognitiveservices.azure.com/.default").token
        # else:
        api_key = key

        client = AzureOpenAI(api_version=api_version, azure_endpoint=base_url, api_key=api_key)
        embeddings = client.embeddings.create(model=deployment_id, input=text)
        return embeddings.dict()['data'][0]['embedding']

Passing embedding model endpoint as a command line argument. As @sarah-widder mentioned it could be set as env variable (EMBEDDING_MODEL_ENDPOINT) as well:

python data_preparation.py --config config.json --njobs=1 --embedding-model-endpoint "https://<oai_env_name>.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-02-01"

Here is the diff for the code change:

microsoft / sample-app-aoai-chatGPT

Data Preparation on PDF files fails #232

doc.contentVector = get_embedding_by_key(chunk, embedding_model_endpoint=embedding_endpoint)

TODO: Detect which errors are retryable