Document.from_batch_process_operation method failing due to sharding made by batch process documents #271

Hi,

I've been facing the following issue when I try to use the Document.from_batch_process_operation method: I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.

Note 1: The same thing happens if I use the 'from_gcs' method, passing the root directory ( dir let's call it) of that operation output as gcs_prefix. If I use the // as gcs_prefix, the method succeeds.

Note 2: The from_gcs method raises

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).

and the from_batch_process_operation method raises:

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

Why the difference in shards (?) between the methods?

Environment details

OS type and version: MacOS 13.0 (22A8380)
Python version: 3.8.0
pip version: 19.2.3

Steps to reproduce

Requirements:

cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
Deprecated==1.2.14
google-api-core==2.17.1
google-auth==2.28.1
google-cloud-bigquery==3.17.2
google-cloud-core==2.4.1
google-cloud-documentai==2.24.0
google-cloud-documentai-toolbox==0.13.0a0
google-cloud-storage==2.14.0
google-cloud-vision==3.7.1
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.12.7
grpcio==1.62.0
grpcio-status==1.62.0
idna==3.6
immutabledict==3.0.0
intervaltree==3.1.0
Jinja2==3.1.3
lxml==4.9.4
MarkupSafe==2.1.5
numpy==1.24.4
packaging==23.2
pandas==2.0.3
pikepdf==8.13.0
pillow==10.2.0
proto-plus==1.23.0
protobuf==4.25.3
pyarrow==15.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
python-dateutil==2.9.0
pytz==2024.1
requests==2.31.0
rsa==4.9
six==1.16.0
sortedcontainers==2.4.0
tabulate==0.9.0
tzdata==2024.1
urllib3==2.2.1
wrapt==1.16.0

Execution:
python3 main.py

Code example

main.py:


from google.cloud import documentai
from google.cloud.documentai_toolbox import document

wrapped_document = document.Document.from_batch_process_operation(
    operation_name=operation_name
    location=location
)

wrapped_document.entities_to_bigquery(
        dataset_name=dataset, table_name=table, project_id=project
)

Stack trace

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    wrapped_document = document.Document.from_batch_process_operation(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
    return cls.from_batch_process_metadata(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
    return [
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
    Document.from_gcs(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
    shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
    raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

googleapis / python-documentai-toolbox