Describe the bug
The data preparation script is only importing 1 chunk of large PDFs. I am attempting to prep and import a 99 page pdf document, running the data preparation script on only that one document using the layout model using the config and command below.
Data preparation script started
Using Form Recognizer resource REDACTED for PDF cracking, with the Layout model.
Preparing data for index: chunkingtesting
Using existing search service REDACTED
Created search index chunkingtesting
Chunking path C:\Users\REDACTED ...
Total files to process=1 out of total directory size=1
Multiprocessing with njobs=4
0%| | 0/1 [00:00<?, ?it/s]SingletonFormRecognizerClient: Creating instance of Form recognizer per process
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.02s/it]
Processed 1 files
Unsupported formats: 0 files
Files with errors: 0 files
Found 1 chunks
Uploading documents to index...
Indexing Chunks...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.60it/s]
Validating index...
Index is empty. Waiting 60 seconds to check again...
Index is empty. Waiting 60 seconds to check again...
Index is empty. Waiting 60 seconds to check again...
The index contains 1 chunks.
The average chunk size of the index is 9712.0 bytes.
Index validation completed
Data preparation for index chunkingtesting completed
Data preparation script completed. 1 indexes updated.`
After this is complete looking at the index that is created I see the following
When looking at the content in that one chunk it is only the first two pages of the PDF. The last 98 pages of data is missing.
Expected behavior
Expect to see many many chunks containing all the information from the 99 page PDF file.
Describe the bug The data preparation script is only importing 1 chunk of large PDFs. I am attempting to prep and import a 99 page pdf document, running the data preparation script on only that one document using the layout model using the config and command below.
Output from script
After this is complete looking at the index that is created I see the following
When looking at the content in that one chunk it is only the first two pages of the PDF. The last 98 pages of data is missing.
Expected behavior Expect to see many many chunks containing all the information from the 99 page PDF file.