Issue with overlapping columns and rows.

microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

MIT License

2.31k stars 256 forks source link

Issue with overlapping columns and rows. #122

Open Prabhav55 opened 1 year ago

Prabhav55 commented 1 year ago

Hi,

I have been using Table Transformer for a project related to extraction and I had a few questions regarding the pre and post processing of outputs:

Currently I am using the DETR feature extraction as the post processing tool for the output. While the accuracy is good, for a threshold of around 60%, I am observing a lot of overlap in the columns. For example, if three columns are present, the output includes five columns with overlap. On reducing the threshold, the output quality decreases for other input. Sample images are attached below:

Is there a way to increase padding for columns in post processing?

Thanks for the help! Happy to provide any other information necessary.

bsmock commented 1 year ago

Hi,

Are you using the model trained only on PubTables-1M? I can see why that model would be confused: it hasn't seen very many tables (if any) where a dollar sign is that far to the left within the column. Have you tried training TATR with FinTabNet.c? We have a script to process the FinTabNet dataset into a dataset called FinTabNet.c that can be used to train TATR. That should help a lot. We have already trained a model jointly on PubTables-1M and FinTabNet.c but we still need to get approval to release the weights.

Cheers, Brandon

Prabhav55 commented 1 year ago

Hi,

Thanks for the quick help. I was trying to look for a way to improve performance with post processing (Due to memory constraints for training) but I think you are right on the fine-tuning part. Just a side question - Is the DETR feature extractor the recommended post processor for table-transformer? HuggingFace also has am AutoImageProcessor.

Thanks, Prabhav

linkstatic12 commented 1 year ago

@bsmock would i need to modify the detection_config.json and structure_config.json when i train the TATR with the FinTabNet dataset?

linkstatic12 commented 1 year ago

I have found that easyOCR is much better than Tesseract when it comes to OCR on PDFs with table and financial data. Also I am trying to use TrOCR with TATR to resolve the issue I am working on. Do the sites like docsumo and extracttables use the TATR or CascadeTabNet. In your opinion which is better CascadeTabNet or TATR? docsumo: docsumo.com extracttables: https://extracttable.com/ CascadeTabNet: https://github.com/DevashishPrasad/CascadeTabNet/tree/master

linkstatic12 commented 1 year ago

You will get an error while running the process_fintabnet.py just modify the code at line 1340: From this: with open(save_filepath, 'w') as out_file: To this: with open(save_filepath, 'w',encoding="utf-8") as out_file: