googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.83k stars 1.53k forks source link

Duplicated tables when pdf page is rotated by 90 degrees #11073

Open fracombs opened 2 years ago

fracombs commented 2 years ago

Using a Form Parser processor, extracting tables from a pdf page which is rotated by 90 causes the output of duplicated tables. Printing the bounding boxes shows that the tables are correctly detected and separated, but printing the text content shows the same result for different tables. Manually rotating the same file and processing it again produces the correct tables content. Switching from documentai_v1 client to documentai_v1beta3 doesn't change anything.

Steps to reproduce

Pdf sample file: https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

  1. Send request do documentai (https://cloud.google.com/document-ai/docs/send-request)
  2. Process the output and extract tables (https://cloud.google.com/document-ai/docs/handle-response#tables)
  3. Rotate the pdf manually and repeat steps 1-2
  4. Compare the output tables: the second run will output for the second page two duplicated tables

Image1: bounding boxes detected (red tables, blue paragraphs). We can see that the two tables are correctly detected sample_tables_rotatedpage1

Image2: text extracted from the 90 degrees rotated pdf. We can see that documentai has detected 2 tables but the content is duplicated. sbagliato

Image3: text extracted from the straight pdf. We can see that documentai has detected 2 tables and the content of the second one is not duplicated. giusto

Edit: I also ran the test using the REST API from a Google VM (following these steps https://www.cloudskillsboost.google/focuses/21028?) and the result is the same, I still get duplicated tables when the page is rotated 90 degrees. You can download the JSON output from the API here: https://drive.google.com/file/d/1jSAr9r8CjxBkw5M97VzogBoWRWRv7gpy/view?usp=sharing

fracombs commented 2 years ago

Any updates on this? It's a pretty important bug that's keeping me from delivering a project to a client and it's almost been a month

parthea commented 1 year ago

I'm going to transfer this issue to the google-cloud-python repository as we are preparing to move the code for google-cloud-documentai to that repository in the next 1-2 weeks.