Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
I've been facing the following issue when I try to use the Document.from_batch_process_operation method:
I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.
Note 1:
The same thing happens if I use the 'from_gcs' method, passing the root directory ( dir let's call it) of that operation output as gcs_prefix. If I use the // as gcs_prefix, the method succeeds.
Note 2:
The from_gcs method raises
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).
and the from_batch_process_operation method raises:
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).
Why the difference in shards (?) between the methods?
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
wrapped_document = document.Document.from_batch_process_operation(
operation_name=operation_name
location=location
)
wrapped_document.entities_to_bigquery(
dataset_name=dataset, table_name=table, project_id=project
)
Stack trace
Traceback (most recent call last):
File "main.py", line 4, in <module>
wrapped_document = document.Document.from_batch_process_operation(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
return cls.from_batch_process_metadata(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
return [
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
Document.from_gcs(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).
Hi,
I've been facing the following issue when I try to use the Document.from_batch_process_operation method: I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.
Note 1: The same thing happens if I use the 'from_gcs' method, passing the root directory ( dir let's call it) of that operation output as gcs_prefix. If I use the // as gcs_prefix, the method succeeds.
Note 2: The from_gcs method raises
and the from_batch_process_operation method raises:
Why the difference in shards (?) between the methods?
Environment details
Steps to reproduce
Requirements:
Execution:
python3 main.py
Code example
main.py:
Stack trace