googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
32 stars 13 forks source link

ToolBox Client Python Library #247

Closed Veeedsss closed 7 months ago

Veeedsss commented 7 months ago

I tried using ToolBox Client Python Library but I am facing this unknown error after execution.

Error: ValueError: Invalid Document - shardInfo.shardCount (0) does not match number of shards (11).

FYI: There are 50 documents that will processed. I am using the same code suggested by google to which I have provided the link for your reference.

Link: https://github.com/GoogleCloudPlatform/document-ai-samples/blob/main/toolbox-batch-processing/documentai-toolbox-batch-entity-extraction.ipynb

shard_error_toolbox
holtskinner commented 7 months ago

Ah, I think this could be an issue with the Notebook rather than the Toolbox Library.

The field mask is set to not include the shardInfo in the response, which is required for the Toolbox to function with multi-shard Documents.

The line:

field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.

Should be changed to:

field_mask = "text,entities,pages,shardInfo"

Or just removed entirely, since the fieldMask is optional.

I'll make an update to the notebook