googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
33 stars 13 forks source link

`split_pdf` splits too much, since it does not take into account that different entities might have same type (but different confidence) #336

Open evekhm opened 3 weeks ago

evekhm commented 3 weeks ago

Here is entities example returned from splitter:

[text_anchor {
  text_segments {
    end_index: 1424
  }
}
type_: "form1"
confidence: 0.96
page_anchor {
  page_refs {
  }
  page_refs {
    page: 1
  }
  page_refs {
    page: 2
  }
}
, text_anchor {
  text_segments {
    start_index: 1424
    end_index: 6935
  }
}
type_: "form1"
confidence: 0.68
page_anchor {
  page_refs {
    page: 3
  }
  page_refs {
    page: 4
  }
}
]

In this case we see that all pages are actually of same type and we should not split. However document.Document.split_pdf would not detect that.

holtskinner commented 2 weeks ago

Ok, this is a bit complicated because the Document AI Custom Splitter specifically detected those two "form1" entries as separate documents.

If we combine them together by default, it could create ambiguity when there are multiple separate documents of the same type in a file.

We could create a parameter like combine_like_document_types or something like that, but I think this issue would be best resolved on the Custom Splitter itself.