Closed gabrielboehme closed 6 months ago
Thanks for the feedback @gabrielboehme. Can you clarify what is in the GCS directory that you're sending to the API? Can you share the output of gsutil ls
on the directory (with any PII removed)?
It seems that there's many more JSON files in the directory than should be based on the Document.shardInfo.shardCount
field. Both Document.from_batch_process_operation()
and Document.from_gcs()
expect that the directory contains Document.JSON files from a specific operation.
Hey @holtskinner, thanks for the reply. Sure, let me clarify:
I used the following code to execute the batch operation on my custom extractor --successfully:
gcs_output_uri = 'gs://bucket_a/output/'
gcs_prefix = 'gs://bucket_a/input/' # This is a folder with many subfolders and files.
gcs_output_config = gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
gcs_uri=gcs_output_uri, field_mask=None
)
input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)
request = documentai.BatchProcessRequest(
name=name,
input_documents=input_config, #
document_output_config=output_config,
)
# BatchProcess returns a Long Running Operation (LRO)
operation = client.batch_process_documents(request)
And separately I used the following code to push to try pushing to BQ:
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
wrapped_document = document.Document.from_batch_process_operation(
operation_name=operation_name # projects/project_id/locations/location/operations/operation_id
location=location
)
wrapped_document.entities_to_bigquery(
dataset_name=dataset, table_name=table, project_id=project
)
Which failed with:
Traceback (most recent call last):
File "main.py", line 4, in <module>
wrapped_document = document.Document.from_batch_process_operation(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 620, in from_batch_process_operation
return cls.from_batch_process_metadata(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 576, in from_batch_process_metadata
return [
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 577, in <listcomp>
Document.from_gcs(
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 507, in from_gcs
shards = _get_shards(gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix)
File "<script_location>/venv/lib/python3.8/site-packages/google/cloud/documentai_toolbox/wrappers/document.py", line 133, in _get_shards
raise ValueError(
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1053).
And when I tried to use the from_gcs method, since the other failed:
wrapped_document = document.Document.from_gcs(
gcs_bucket_name=gcs_bucket_name,
gcs_prefix=gcs_prefix
)
I got the following:
ValueError: Invalid Document - shardInfo.shardCount (1) does not match number of shards (1942).
My gsutil ls result (always one json per subfolder):
gs://bucket_a/output/<operation_id>/0/<id_like_filename>.json
gs://bucket_a/output/<operation_id>/1/<id_like_filename>.json
gs://bucket_a/output/<operation_id>/2/<id_like_filename>.json
.
.
.
@gabrielboehme Thanks for that extra information. I'll try to reproduce on my end. Can you also specify how you're getting operation_name
? This should be the way to get the operation name:
operation = client.batch_process_documents(request)
operation_name = operation.operation.name
wrapped_document = document.Document.from_batch_process_operation(
operation_name=operation_name
location=location
)
Note - I got a different error when running on Python3.8 that has been fixed in 0.13.1a0
, so I ran it on that version.
I'm able to reproduce the behavior. Seems like the fetching of json files from GCS isn't working as expected and too many JSONs are being imported as a single Document
. I'll investigate further.
My working theory is that batch_process_metadata.output_gcs_destination
looks like this:
output_gcs_destination: "gs://bucket-name/toolbox-test-output/15721623642972401514/1"
And without the trailing slash at the end, other folders like:
gs://bucket-name/toolbox-test-output/15721623642972401514/11
gs://bucket-name/toolbox-test-output/15721623642972401514/12
etc
Are all being read in as well. It's a quirk of how GCS prefixes work.
I'll also add unit/integration tests for this edge case.
PR #274 should fix the error message. Although, your code to upload entities to BigQuery won't work as is. from_batch_process_operation()
returns a List
of Document
s, so you'd need to do this:
wrapped_documents = document.Document.from_batch_process_operation(
operation_name=operation_name, # projects/project_id/locations/location/operations/operation_id
location=location
)
for wrapped_document in wrapped_documents:
wrapped_document.entities_to_bigquery(
dataset_name=dataset, table_name=table, project_id=project
)
@holtskinner I see. Thanks for the help and for solving it quickly!
Hi,
I've been facing the following issue when I try to use the
Document.from_batch_process_operation()
method: I expected that this method, given a succeeded operation (like mine), would get all the output files from GCS and serialize them into document objects. Instead, it's failing with the bellow described exception, and the thing is: the output of the operation was done through a BatchProcessing operations with the Python SDK, so I expected that both methods of the Python SDK would match output and input.Note 1: The same thing happens if I use the dir let's call it) of that operation output as // as
from_gcs()
method, passing the root directory (gcs_prefix
. If I use thegcs_prefix
, the method succeeds.Note 2: The
from_gcs()
method raisesand the
from_batch_process_operation()
method raises:Why the difference in shards (?) between the methods?
Environment details
Steps to reproduce
Requirements:
Execution:
python3 main.py
Code example
main.py:
Stack trace