microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.01k stars 231 forks source link

No CSV or HTML results generated #141

Open Kehindeajayi01 opened 9 months ago

Kehindeajayi01 commented 9 months ago

Hi, Thanks for the great work! I used the "inference.py" file on a sample table images with the goal of obtaining the extracted cells in either csv or html format but none was generated. Please, what am I doing wrongly?

Screenshot 2023-09-12 at 1 08 58 PM
linkstatic12 commented 9 months ago

I might be wrong here but the table transformer doesn't do OCR.

Kehindeajayi01 commented 9 months ago

I might be wrong here but the table transformer doesn't do OCR.

Have you checked the inference.py file provided? Specifically, the cells_to_csv and cells_to_html functions.

linkstatic12 commented 9 months ago

yes both functions don't have OCR on them.

bqcao commented 9 months ago

A related question, how to get the bboxes for all the detected cells? It is in the return of self.det_model(), it seems in my examples, outputs['pred_boxes'].shape is always [1, 15, 4], why always 15?

linkstatic12 commented 8 months ago

https://colab.research.google.com/drive/1lLRyBr7WraGdUJm-urUm_utArw6SkoCJ?usp=sharing

Take a look at this

Dipankar1997161 commented 8 months ago

Table Transformer, doesn't perform OCR directly, they did mention to pass words_dir which is output of some OCR you used.

so the csv files I think should carry the bounding box and other factors ORR it might carry the OCR section (WORDS) found within the ROI when the tables are saved..

SuleyNL commented 7 months ago

You can use Extractable to extract csv or xml from pdf-tables. It is built on top of Microsofts' TATR and is compatible with windows ubuntu and macOS

linkstatic12 commented 7 months ago

I ran extractable thinking it would be a good alternative to my implementation but sadly it's not that good.

SuleyNL commented 7 months ago

@linkstatic12 what seemed to be the issue?

sharvaridhote commented 7 months ago

@SuleyNL , extractable is good and it does work to get table images. However, csv or xml returns bounding box values and not actual table values and labels.

SuleyNL commented 7 months ago

@sharvaridhote Could you please share the code you ran that produces the problem? I cannot seem to recreate it using this;

import extractable as ex
table_pdf_file = 'WNT1.pdf'
empty_folder = 'WNT1xml'
ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.XML, mode=ex.Mode.PERFORMANCE)

Have you ran it in PRESENTATION-mode? perhaps it didnt recognize your table columns and rows? You can do it by adding mode=ex.Mode.PRESENTATION to the inputs of the extract() function

sharvaridhote commented 7 months ago

Hi @SuleyNL, Thank you for the reply. I am trying to extract tables from this :https://www.imf.org/external/pubs/ft/ar/2022/downloads/2022-financial-statements.pdf I have tried different modes and xml and csv format. I was expecting better formatted table output. It still does great job in locating tables. I can use output table images. It would be good to get features such as which page and how many tables, table location, bounding box and page number as an output. Thanks

SuleyNL commented 7 months ago

Here are some images of tables of which it detected structure in; table_2 Figure_2 Figure_3

and here are some excel tables i managed to get extracted from it; table_39 table_63

Unfortunately it is not perfect and in some cases still requires some post-processing by the programmer such as in this case: table_56

If you want to access a list of all tables you can do so by accessing the return value from the extract() function:

dataobj = ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.EXCEL, mode=ex.Mode.PERFORMANCE)
list_of_table_coords_and_page = dataobj.data['table_locations']

print(list_of_table_coords_and_page)

This will output a list of dictionaries containing the x and y coordinates aswell as the page, for each detected table:

[{'x': 183, 'y': 280, 'page': 3}, {'x': 132, 'y': 395, 'page': 10}, {'x': 134, 'y': 375, 'page': 11}, {'x': 139, 'y': 432, 'page': 12}, {'x': 136, 'y': 463, 'page': 13}, {'x': 134, 'y': 1141, 'page': 16}, {'x': 881, 'y': 266, 'page': 16}, {'x': 153, 'y': 1783, 'page': 17}, {'x': 878, 'y': 706, 'page': 23}, {'x': 878, 'y': 1168, 'page': 23}, {'x': 886, 'y': 1502, 'page': 24}, {'x': 878, 'y': 834, 'page': 25}, {'x': 873, 'y': 424, 'page': 25}, {'x': 878, 'y': 845, 'page': 27}, {'x': 136, 'y': 371, 'page': 27}, {'x': 874, 'y': 348, 'page': 27}, {'x': 884, 'y': 301, 'page': 28}, {'x': 129, 'y': 939, 'page': 29}, {'x': 888, 'y': 1804, 'page': 29}, {'x': 886, 'y': 488, 'page': 30}, {'x': 890, 'y': 1182, 'page': 30}, {'x': 139, 'y': 546, 'page': 31}, {'x': 891, 'y': 428, 'page': 31}, {'x': 138, 'y': 854, 'page': 32}, {'x': 879, 'y': 796, 'page': 32}, {'x': 864, 'y': 389, 'page': 33}, {'x': 887, 'y': 1196, 'page': 34}, {'x': 875, 'y': 1454, 'page': 35}, {'x': 144, 'y': 1699, 'page': 35}, {'x': 137, 'y': 284, 'page': 35}, {'x': 142, 'y': 1535, 'page': 36}, {'x': 146, 'y': 1744, 'page': 38}, {'x': 884, 'y': 909, 'page': 39}, {'x': 871, 'y': 382, 'page': 39}, {'x': 134, 'y': 1685, 'page': 40}, {'x': 141, 'y': 868, 'page': 41}, {'x': 149, 'y': 399, 'page': 44}, {'x': 149, 'y': 356, 'page': 45}, {'x': 149, 'y': 370, 'page': 46}, {'x': 147, 'y': 378, 'page': 47}, {'x': 140, 'y': 410, 'page': 48}, {'x': 143, 'y': 395, 'page': 50}, {'x': 134, 'y': 402, 'page': 51}, {'x': 305, 'y': 1950, 'page': 54}
...
...
...}]

If you would like to see the confidence aswell, that is also possible since it is also logged by Extractable but it would be a bit more tricky to get. The confidence for each table is found in: dataobj.data['TableDetectorTATR']['detection']

dataobj = ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.EXCEL, mode=ex.Mode.PERFORMANCE)
list_of_logs_containing_confidence = dataobj.data['TableDetectorTATR']['detection']

print(list_of_logs_containing_confidence )

this would return:

['Detected table with confidence: 0.942 at location: [223.73, 320.86, 1570.85, 524.63]', 
'Detected table with confidence: 0.998 at location: [172.82, 435.92, 1550.88, 1273.52]', 
'Detected table with confidence: 0.999 at location: [174.49, 415.85, 1524.34, 1249.1]', 
'Detected table with confidence: 0.997 at location: [179.45, 472.47, 1528.12, 750.77]', 
'Detected table with confidence: 0.999 at location: [176.47, 503.17, 1544.16, 1798.65]', 
'Detected table with confidence: 0.999 at location: [174.29, 1181.46, 814.84, 1379.8]', 
'Detected table with confidence: 0.978 at location: [921.51, 306.78, 1423.8, 489.91]', 
'Detected table with confidence: 0.919 at location: [193.14, 1823.28, 871.23, 2056.43]', 
'Detected table with confidence: 0.903 at location: [918.15, 746.17, 1574.14, 965.21]', 
'Detected table with confidence: 1.0 at location: [918.04, 1208.17, 1575.54, 1364.85]', 
'Detected table with confidence: 0.976 at location: [926.71, 1542.82, 1515.95, 1871.22]',
...
...
... ]

these are the x, y, width and height values as provided by TATR, those are very tightly fitted to the table. I recommend expanding all of the borders by 20px to get an adequate image of the full table.

If the column and row detection is not according to your wish you can also tune the sensitivity of the model by manually entering into the extractable/StructureDetector.py file and going to line 166 to 178. Here, extractable decides which columns and rows to keep and which to 'throw away' based on TATR's confidence.

if not (label == 1 and score <= .88) and \  #This means only keep columns when confidence is higher than 88%
not (label == 2 and score <= .64) and \     #This means only keep rows when confidence is higher than 64%

I understand that this is a temporary solution, I am working on allowing these values to be passed in as input variables in the extract() function so you dont need to get into the code to change them.

I hope you got some value out of it. If you have any feedback or suggestions regarding extractable, let me know!

mirix commented 7 months ago

I have provided the tokens as a list of dictionaries as suggested here:

https://github.com/microsoft/table-transformer/blob/main/docs/INFERENCE.md

However, it seems it is completely ignored.

I guess that the bounding boxes differ a bit, but still...

omid-ghozatlou commented 2 weeks ago

Hello everyone! I have some unstructured tables which they confuse LLM to extract correct values of tables. I was thinking to use this code to extract tables in CSV format and then I feed LLM. However @linkstatic12 said this code does not give OCR. What do you recommend me to extract correct table csv format?