`Document.from_batch_process_operation()` method failing due to sharding made by batch process documents

gabrielboehme commented 6 months ago

Hi,

I've been facing the following issue when I try to use the Document.from_batch_process_operation() method: I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.

Note 1: The same thing happens if I use the from_gcs() method, passing the root directory ( dir let's call it) of that operation output as gcs_prefix. If I use the // as gcs_prefix, the method succeeds.

Note 2: The from_gcs() method raises

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).

and the from_batch_process_operation() method raises:

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

Why the difference in shards (?) between the methods?

Environment details

OS type and version: MacOS 13.0 (22A8380)
Python version: 3.8.0
pip version: 19.2.3

Steps to reproduce

Requirements:

cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
Deprecated==1.2.14
google-api-core==2.17.1
google-auth==2.28.1
google-cloud-bigquery==3.17.2
google-cloud-core==2.4.1
google-cloud-documentai==2.24.0
google-cloud-documentai-toolbox==0.13.0a0
google-cloud-storage==2.14.0
google-cloud-vision==3.7.1
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
grpc-google-iam-v1==0.12.7
grpcio==1.62.0
grpcio-status==1.62.0
idna==3.6
immutabledict==3.0.0
intervaltree==3.1.0
Jinja2==3.1.3
lxml==4.9.4
MarkupSafe==2.1.5
numpy==1.24.4
packaging==23.2
pandas==2.0.3
pikepdf==8.13.0
pillow==10.2.0
proto-plus==1.23.0
protobuf==4.25.3
pyarrow==15.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
python-dateutil==2.9.0
pytz==2024.1
requests==2.31.0
rsa==4.9
six==1.16.0
sortedcontainers==2.4.0
tabulate==0.9.0
tzdata==2024.1
urllib3==2.2.1
wrapt==1.16.0

Execution:
python3 main.py

Code example

main.py:


from google.cloud import documentai
from google.cloud.documentai_toolbox import document

wrapped_document = document.Document.from_batch_process_operation(
    operation_name=operation_name
    location=location
)

wrapped_document.entities_to_bigquery(
        dataset_name=dataset, table_name=table, project_id=project
)

Stack trace

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    wrapped_document = document.Document.from_batch_process_operation(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
    return cls.from_batch_process_metadata(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
    return [
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
    Document.from_gcs(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
    shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
    raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

holtskinner commented 6 months ago

Thanks for the feedback @gabrielboehme. Can you clarify what is in the GCS directory that you're sending to the API? Can you share the output of gsutil ls on the directory (with any PII removed)?

It seems that there's many more JSON files in the directory than should be based on the Document.shardInfo.shardCount field. Both Document.from_batch_process_operation() and Document.from_gcs() expect that the directory contains Document.JSON files from a specific operation.

gabrielboehme commented 6 months ago

Hey @holtskinner, thanks for the reply. Sure, let me clarify:

I used the following code to execute the batch operation on my custom extractor --successfully:

gcs_output_uri = 'gs://bucket_a/output/'
gcs_prefix = 'gs://bucket_a/input/' # This is a folder with many subfolders and files.
gcs_output_config = gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
    gcs_uri=gcs_output_uri, field_mask=None
)

input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

request = documentai.BatchProcessRequest(
    name=name,
    input_documents=input_config, #
    document_output_config=output_config,
)
    # BatchProcess returns a Long Running Operation (LRO)
operation = client.batch_process_documents(request)

And separately I used the following code to push to try pushing to BQ:

from google.cloud import documentai
from google.cloud.documentai_toolbox import document

wrapped_document = document.Document.from_batch_process_operation(
    operation_name=operation_name # projects/project_id/locations/location/operations/operation_id
    location=location
)

wrapped_document.entities_to_bigquery(
        dataset_name=dataset, table_name=table, project_id=project
)

Which failed with:

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    wrapped_document = document.Document.from_batch_process_operation(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
    return cls.from_batch_process_metadata(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
    return [
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
    Document.from_gcs(
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
    shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
  File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
    raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).

And when I tried to use the from_gcs method, since the other failed:

wrapped_document = document.Document.from_gcs(
        gcs_bucket_name=gcs_bucket_name,
        gcs_prefix=gcs_prefix
    )

I got the following:

ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).

My gsutil ls result (always one json per subfolder):

gs://bucket_a/output/<operation_id>/0/<id_like_filename>.json
gs://bucket_a/output/<operation_id>/1/<id_like_filename>.json
gs://bucket_a/output/<operation_id>/2/<id_like_filename>.json
.
.
.

holtskinner commented 6 months ago

@gabrielboehme Thanks for that extra information. I'll try to reproduce on my end. Can you also specify how you're getting operation_name? This should be the way to get the operation name:

operation = client.batch_process_documents(request)
operation_name = operation.operation.name

wrapped_document = document.Document.from_batch_process_operation(
    operation_name=operation_name
    location=location
)

holtskinner commented 6 months ago

Note - I got a different error when running on Python3.8 that has been fixed in 0.13.1a0, so I ran it on that version.

I'm able to reproduce the behavior. Seems like the fetching of json files from GCS isn't working as expected and too many JSONs are being imported as a single Document. I'll investigate further.

holtskinner commented 6 months ago

My working theory is that batch_process_metadata.output_gcs_destination looks like this:

output_gcs_destination: "gs://bucket-name/toolbox-test-output/15721623642972401514/1"

And without the trailing slash at the end, other folders like:

gs://bucket-name/toolbox-test-output/15721623642972401514/11
gs://bucket-name/toolbox-test-output/15721623642972401514/12
etc

Are all being read in as well. It's a quirk of how GCS prefixes work.

I'll also add unit/integration tests for this edge case.

holtskinner commented 6 months ago

PR #274 should fix the error message. Although, your code to upload entities to BigQuery won't work as is. from_batch_process_operation() returns a List of Documents, so you'd need to do this:

wrapped_documents = document.Document.from_batch_process_operation(
    operation_name=operation_name, # projects/project_id/locations/location/operations/operation_id
    location=location
)

for wrapped_document in wrapped_documents:
    wrapped_document.entities_to_bigquery(
            dataset_name=dataset, table_name=table, project_id=project
    )

gabrielboehme commented 6 months ago

@holtskinner I see. Thanks for the help and for solving it quickly!

googleapis / python-documentai-toolbox