googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
33 stars 13 forks source link

`Document.entities` field is unusable when using data from Classifier output #332

Closed evekhm closed 2 weeks ago

evekhm commented 3 weeks ago

Hello,

The wrapped_document, when using document.from_batch_process_metadata (or any other methods) will be missing entities field when using data from the Classifier.

When using output of splitter, everything works fine. But with classifier - you wont get any important information like type and confidence.

from google.cloud.documentai_toolbox import document
import os

doc = document.Document.from_document_path(os.path.join(os.path.dirname(__file__), "output-document_split.json"))
print(doc.entities)
doc = document.Document.from_document_path(os.path.join(os.path.dirname(__file__), "output-document_classify.json"))
print(doc.entities)

output-document_split.json output-document_classify.json

evekhm commented 3 weeks ago

I do see that information is there inside shards.entities, but entities itself is totally broken/missing/unusable

evekhm commented 3 weeks ago

Looking further at the issue:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/dataclasses.py", line 405, in wrapper
    result = user_function(self)
  File "<string>", line 3, in __repr__
AttributeError: 'Entity' object has no attribute 'start_page'

Both start_page and end_page need to be made Optional (since this info is not provided by the Classifier)