Open fracombs opened 2 years ago
Any updates on this? It's a pretty important bug that's keeping me from delivering a project to a client and it's almost been a month
I'm going to transfer this issue to the google-cloud-python
repository as we are preparing to move the code for google-cloud-documentai
to that repository in the next 1-2 weeks.
Using a Form Parser processor, extracting tables from a pdf page which is rotated by 90 causes the output of duplicated tables. Printing the bounding boxes shows that the tables are correctly detected and separated, but printing the text content shows the same result for different tables. Manually rotating the same file and processing it again produces the correct tables content. Switching from documentai_v1 client to documentai_v1beta3 doesn't change anything.
google-cloud-documentai
version: 1.5.0Steps to reproduce
Pdf sample file: https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf
Image1: bounding boxes detected (red tables, blue paragraphs). We can see that the two tables are correctly detected
Image2: text extracted from the 90 degrees rotated pdf. We can see that documentai has detected 2 tables but the content is duplicated.
Image3: text extracted from the straight pdf. We can see that documentai has detected 2 tables and the content of the second one is not duplicated.
Edit: I also ran the test using the REST API from a Google VM (following these steps https://www.cloudskillsboost.google/focuses/21028?) and the result is the same, I still get duplicated tables when the page is rotated 90 degrees. You can download the JSON output from the API here: https://drive.google.com/file/d/1jSAr9r8CjxBkw5M97VzogBoWRWRv7gpy/view?usp=sharing