Open Kehindeajayi01 opened 9 months ago
I might be wrong here but the table transformer doesn't do OCR.
I might be wrong here but the table transformer doesn't do OCR.
Have you checked the inference.py file provided? Specifically, the cells_to_csv and cells_to_html functions.
yes both functions don't have OCR on them.
A related question, how to get the bboxes for all the detected cells? It is in the return of self.det_model(), it seems in my examples, outputs['pred_boxes'].shape is always [1, 15, 4], why always 15?
Table Transformer, doesn't perform OCR directly, they did mention to pass words_dir which is output of some OCR you used.
so the csv files I think should carry the bounding box and other factors ORR it might carry the OCR section (WORDS) found within the ROI when the tables are saved..
You can use Extractable to extract csv or xml from pdf-tables. It is built on top of Microsofts' TATR and is compatible with windows ubuntu and macOS
I ran extractable thinking it would be a good alternative to my implementation but sadly it's not that good.
@linkstatic12 what seemed to be the issue?
@SuleyNL , extractable is good and it does work to get table images. However, csv or xml returns bounding box values and not actual table values and labels.
@sharvaridhote Could you please share the code you ran that produces the problem? I cannot seem to recreate it using this;
import extractable as ex
table_pdf_file = 'WNT1.pdf'
empty_folder = 'WNT1xml'
ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.XML, mode=ex.Mode.PERFORMANCE)
Have you ran it in PRESENTATION
-mode? perhaps it didnt recognize your table columns and rows? You can do it by adding mode=ex.Mode.PRESENTATION
to the inputs of the extract() function
Hi @SuleyNL, Thank you for the reply. I am trying to extract tables from this :https://www.imf.org/external/pubs/ft/ar/2022/downloads/2022-financial-statements.pdf I have tried different modes and xml and csv format. I was expecting better formatted table output. It still does great job in locating tables. I can use output table images. It would be good to get features such as which page and how many tables, table location, bounding box and page number as an output. Thanks
Here are some images of tables of which it detected structure in;
and here are some excel tables i managed to get extracted from it;
Unfortunately it is not perfect and in some cases still requires some post-processing by the programmer such as in this case:
If you want to access a list of all tables you can do so by accessing the return value from the extract() function:
dataobj = ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.EXCEL, mode=ex.Mode.PERFORMANCE)
list_of_table_coords_and_page = dataobj.data['table_locations']
print(list_of_table_coords_and_page)
This will output a list of dictionaries containing the x
and y
coordinates aswell as the page
, for each detected table:
[{'x': 183, 'y': 280, 'page': 3}, {'x': 132, 'y': 395, 'page': 10}, {'x': 134, 'y': 375, 'page': 11}, {'x': 139, 'y': 432, 'page': 12}, {'x': 136, 'y': 463, 'page': 13}, {'x': 134, 'y': 1141, 'page': 16}, {'x': 881, 'y': 266, 'page': 16}, {'x': 153, 'y': 1783, 'page': 17}, {'x': 878, 'y': 706, 'page': 23}, {'x': 878, 'y': 1168, 'page': 23}, {'x': 886, 'y': 1502, 'page': 24}, {'x': 878, 'y': 834, 'page': 25}, {'x': 873, 'y': 424, 'page': 25}, {'x': 878, 'y': 845, 'page': 27}, {'x': 136, 'y': 371, 'page': 27}, {'x': 874, 'y': 348, 'page': 27}, {'x': 884, 'y': 301, 'page': 28}, {'x': 129, 'y': 939, 'page': 29}, {'x': 888, 'y': 1804, 'page': 29}, {'x': 886, 'y': 488, 'page': 30}, {'x': 890, 'y': 1182, 'page': 30}, {'x': 139, 'y': 546, 'page': 31}, {'x': 891, 'y': 428, 'page': 31}, {'x': 138, 'y': 854, 'page': 32}, {'x': 879, 'y': 796, 'page': 32}, {'x': 864, 'y': 389, 'page': 33}, {'x': 887, 'y': 1196, 'page': 34}, {'x': 875, 'y': 1454, 'page': 35}, {'x': 144, 'y': 1699, 'page': 35}, {'x': 137, 'y': 284, 'page': 35}, {'x': 142, 'y': 1535, 'page': 36}, {'x': 146, 'y': 1744, 'page': 38}, {'x': 884, 'y': 909, 'page': 39}, {'x': 871, 'y': 382, 'page': 39}, {'x': 134, 'y': 1685, 'page': 40}, {'x': 141, 'y': 868, 'page': 41}, {'x': 149, 'y': 399, 'page': 44}, {'x': 149, 'y': 356, 'page': 45}, {'x': 149, 'y': 370, 'page': 46}, {'x': 147, 'y': 378, 'page': 47}, {'x': 140, 'y': 410, 'page': 48}, {'x': 143, 'y': 395, 'page': 50}, {'x': 134, 'y': 402, 'page': 51}, {'x': 305, 'y': 1950, 'page': 54}
...
...
...}]
If you would like to see the confidence aswell, that is also possible since it is also logged by Extractable but it would be a bit more tricky to get. The confidence for each table is found in: dataobj.data['TableDetectorTATR']['detection']
dataobj = ex.extract(table_pdf_file, empty_folder, output_filetype=ex.Filetype.EXCEL, mode=ex.Mode.PERFORMANCE)
list_of_logs_containing_confidence = dataobj.data['TableDetectorTATR']['detection']
print(list_of_logs_containing_confidence )
this would return:
['Detected table with confidence: 0.942 at location: [223.73, 320.86, 1570.85, 524.63]',
'Detected table with confidence: 0.998 at location: [172.82, 435.92, 1550.88, 1273.52]',
'Detected table with confidence: 0.999 at location: [174.49, 415.85, 1524.34, 1249.1]',
'Detected table with confidence: 0.997 at location: [179.45, 472.47, 1528.12, 750.77]',
'Detected table with confidence: 0.999 at location: [176.47, 503.17, 1544.16, 1798.65]',
'Detected table with confidence: 0.999 at location: [174.29, 1181.46, 814.84, 1379.8]',
'Detected table with confidence: 0.978 at location: [921.51, 306.78, 1423.8, 489.91]',
'Detected table with confidence: 0.919 at location: [193.14, 1823.28, 871.23, 2056.43]',
'Detected table with confidence: 0.903 at location: [918.15, 746.17, 1574.14, 965.21]',
'Detected table with confidence: 1.0 at location: [918.04, 1208.17, 1575.54, 1364.85]',
'Detected table with confidence: 0.976 at location: [926.71, 1542.82, 1515.95, 1871.22]',
...
...
... ]
these are the x, y, width and height values as provided by TATR, those are very tightly fitted to the table. I recommend expanding all of the borders by 20px to get an adequate image of the full table.
If the column and row detection is not according to your wish you can also tune the sensitivity of the model by manually entering into the extractable/StructureDetector.py
file and going to line 166 to 178. Here, extractable decides which columns and rows to keep and which to 'throw away' based on TATR's confidence.
if not (label == 1 and score <= .88) and \ #This means only keep columns when confidence is higher than 88%
not (label == 2 and score <= .64) and \ #This means only keep rows when confidence is higher than 64%
I understand that this is a temporary solution, I am working on allowing these values to be passed in as input variables in the extract() function so you dont need to get into the code to change them.
I hope you got some value out of it. If you have any feedback or suggestions regarding extractable, let me know!
I have provided the tokens as a list of dictionaries as suggested here:
https://github.com/microsoft/table-transformer/blob/main/docs/INFERENCE.md
However, it seems it is completely ignored.
I guess that the bounding boxes differ a bit, but still...
Hello everyone! I have some unstructured tables which they confuse LLM to extract correct values of tables. I was thinking to use this code to extract tables in CSV format and then I feed LLM. However @linkstatic12 said this code does not give OCR. What do you recommend me to extract correct table csv format?
Hi, Thanks for the great work! I used the "inference.py" file on a sample table images with the goal of obtaining the extracted cells in either csv or html format but none was generated. Please, what am I doing wrongly?