microsoft / sample-app-aoai-chatGPT

Sample code for a simple web chat experience through Azure OpenAI, including Azure OpenAI On Your Data.
MIT License
1.59k stars 2.49k forks source link

Data preperation script only imports 1 chunk of large pdfs #774

Closed andrewwiebe closed 5 months ago

andrewwiebe commented 5 months ago

Describe the bug The data preparation script is only importing 1 chunk of large PDFs. I am attempting to prep and import a 99 page pdf document, running the data preparation script on only that one document using the layout model using the config and command below.

[
    {
        "data_path": "C:\\Users\\PATHTODATA",
        "location": "eastus",
        "subscription_id": "SUBID",
        "resource_group": "RG NAME",
        "search_service_name": "SEARCH NAME",
        "index_name": "chunkingtesting",
        "chunk_size": 1024,
        "token_overlap": 128,
        "semantic_config_name": "default",
        "language": "en"
    }
]
python data_preparation.py --config config.json --njobs=4 --form-rec-resource FORMRECOGNAME --form-rec-key KEYGOESHERE --form-rec-use-layout

Output from script

Data preparation script started
Using Form Recognizer resource REDACTED for PDF cracking, with the Layout model.
Preparing data for index: chunkingtesting
Using existing search service REDACTED 
Created search index chunkingtesting
Chunking path C:\Users\REDACTED ...
Total files to process=1 out of total directory size=1
Multiprocessing with njobs=4
  0%|                                                                                                                                                                                             | 0/1 [00:00<?, ?it/s]SingletonFormRecognizerClient: Creating instance of Form recognizer per process
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.02s/it]
Processed 1 files
Unsupported formats: 0 files
Files with errors: 0 files
Found 1 chunks
Uploading documents to index...
Indexing Chunks...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.60it/s]
Validating index...
Index is empty. Waiting 60 seconds to check again...
Index is empty. Waiting 60 seconds to check again...
Index is empty. Waiting 60 seconds to check again...
The index contains 1 chunks.
The average chunk size of the index is 9712.0 bytes.
Index validation completed
Data preparation for index chunkingtesting completed
Data preparation script completed. 1 indexes updated.`

After this is complete looking at the index that is created I see the following

image

When looking at the content in that one chunk it is only the first two pages of the PDF. The last 98 pages of data is missing.

Expected behavior Expect to see many many chunks containing all the information from the 99 page PDF file.

andrewwiebe commented 5 months ago

This issue was because I was on the free SKU of the Document Intelligence resource. Was I changed to S0, it is working properly now.