googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
32 stars 13 forks source link

Document.from_batch_process_operation method failing due to sharding made by batch process documents #271 #273

Closed gabrielboehme closed 6 months ago

gabrielboehme commented 6 months ago

Hi,

I've been facing the following issue when I try to use the Document.from_batch_process_operation method: I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.

Note 1: The same thing happens if I use the 'from_gcs' method, passing the root directory ( dir let's call it) of that operation output as gcs_prefix. If I use the // as gcs_prefix, the method succeeds.

Note 2: The from_gcs method raises

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).

and the from_batch_process_operation method raises:

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

Why the difference in shards (?) between the methods?

Environment details

Steps to reproduce

  1. Requirements:

    cachetools==5.3.3
    certifi==2024.2.2
    charset-normalizer==3.3.2
    Deprecated==1.2.14
    google-api-core==2.17.1
    google-auth==2.28.1
    google-cloud-bigquery==3.17.2
    google-cloud-core==2.4.1
    google-cloud-documentai==2.24.0
    google-cloud-documentai-toolbox==0.13.0a0
    google-cloud-storage==2.14.0
    google-cloud-vision==3.7.1
    google-crc32c==1.5.0
    google-resumable-media==2.7.0
    googleapis-common-protos==1.62.0
    grpc-google-iam-v1==0.12.7
    grpcio==1.62.0
    grpcio-status==1.62.0
    idna==3.6
    immutabledict==3.0.0
    intervaltree==3.1.0
    Jinja2==3.1.3
    lxml==4.9.4
    MarkupSafe==2.1.5
    numpy==1.24.4
    packaging==23.2
    pandas==2.0.3
    pikepdf==8.13.0
    pillow==10.2.0
    proto-plus==1.23.0
    protobuf==4.25.3
    pyarrow==15.0.0
    pyasn1==0.5.1
    pyasn1-modules==0.3.0
    python-dateutil==2.9.0
    pytz==2024.1
    requests==2.31.0
    rsa==4.9
    six==1.16.0
    sortedcontainers==2.4.0
    tabulate==0.9.0
    tzdata==2024.1
    urllib3==2.2.1
    wrapt==1.16.0
  2. Execution:

  3. python3 main.py

Code example

main.py:


from google.cloud import documentai
from google.cloud.documentai_toolbox import document

wrapped_document = document.Document.from_batch_process_operation(
    operation_name=operation_name
    location=location
)

wrapped_document.entities_to_bigquery(
        dataset_name=dataset, table_name=table, project_id=project
)

Stack trace

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    wrapped_document = document.Document.from_batch_process_operation(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
    return cls.from_batch_process_metadata(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
    return [
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
    Document.from_gcs(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
    shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
    raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).
parthea commented 6 months ago

duplicate of #271