googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
32 stars 13 forks source link

Convert Document AI Object to Preserve Layout Text? #159

Open raad-altaie opened 1 year ago

raad-altaie commented 1 year ago

Is your feature request related to a problem? Please describe.

I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.

In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the config='-c preserve_interword_spaces=1' option which is kind of does the same thing. I really wish if "python-documentai-toolbox" could support such output.

Describe the solution you'd like

documentai object => preserved layout text

Describe alternatives you've considered

Extracting text using the pdftotext library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.

holtskinner commented 12 months ago

Can you provide more information on what you mean by "preserving the layout of the text"?

Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document?

An example of an input document and the output text would be useful.

This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.)

Document AI by design doesn't fill in the Document.text field with extra spaces/tabs to signify where the text sits on the page.

It could be possible to use the Document.Page.Block field to identify blocks of text and place them generally in the same order, but again this isn't going to be very exact since Coordinates don't have a 1-1 relationship in text files.

raad-altaie commented 12 months ago

@holtskinner thank you for your response! what i am looking for something like the example below.

image:

input

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output string as of the same structure in image?

i.e. as follows:

                                                 Some text here
                                                 Some text here

Some to the left
Some to the left

                    Some in the middle
                    Some in the middle

        Some with some tab
        Some with some tab

Some with some space between them                       this much
Some with some space between them                       this much
think-diff commented 11 months ago

we want to do the same thing here!

ThreeHAN commented 9 months ago

At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as twowords as opposed to two words. Having a helper function ensure spaces are there would reduce custom post processing for us.

nonlocalStream commented 4 months ago

+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this python -m fitz gettext https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/__main__.py#L802)

zkalson commented 4 months ago

Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python: -For each page in a document, create a reportlab Canvas object -Create a text layer on the Canvas object and write the text onto it, using the bounding box data -Save the PDF and use poppler or pypdf to extract the text layer into a layout-preserved .txt file

The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here

If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature!

Attached is an example input and output. Input-SampleDocumentAITextLayout.pdf Output-SampleDocumentAITextLayout.txt