microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.22k stars 247 forks source link

[Question] Row Detection is Bad #110

Open Saeed11b95 opened 1 year ago

Saeed11b95 commented 1 year ago

Hi @bsmock
Thank you so much for your awesome work. I am finetuning TSR Transformer for financial data, I have used scripts from the repo to convert Fintabnet to PubTables-1M dataset format. However, despite multiple finetuning runs my row detection isn't improving. Mostly, it is coming out pretty good at the top and bottom of the tables, it's pretty bad towards the center of the tables. I have tried NMS to remove false positives but it doesn't help. Furthermore, I have tried visualizing the backbone layers to understand the behavior but it didn't make any sense. 06 Microsoft pdf0 16 PBB 2021 pdf0 ![18 DKSH pdf0](https://user-images.githubusercontent.com/91561697/2337946 27 Verizon pdf0 74-5e1d561e-358d-490a-a27f-dc29d68439e8.png) Please help me understand the reason for this behavior and the solution to this issue. These test images are not from Fintabnet but they are similar style tables. I have also tried finetuning with only the big tables in Fintabnet since these test images are all big tables but nothing seems to be improving the row detection. Thanks and Regards

bsmock commented 1 year ago

Yes, my best guess is this happens because there are not enough object queries in the model to handle tables above a certain size. DETR is known to need a certain number of extra object queries beyond the number of objects to be recognized. Our pre-trained model uses 125 object queries.

Assuming you just want to have a better model and do not want to engineer solutions around using the current model, you probably need to increase the number of object queries to 175 or more. You can easily transfer the pre-trained backbone and encoder to this new model. But you might need to write custom code to copy weights if you want to preserve the first 125 learned query embeddings in the decoder.

Hope that helps!

Best, Brandon

Saeed11b95 commented 1 year ago

Thanks for your response... I think object queries are not the problem because row detection is also bad on images with less than 50 objects per image.

WalidHadri-Iron commented 1 year ago

@Saeed11b95 I am having the same issue with long tables (when the height is dominating). One way that I have managed to find to fix the issue with some of them, is by stretching the table vertically (ie. I do some resizing where the height is more important than the width compared to original image).

PS : I have not finetuned the model on FinTabNet I am using the weights available here (when trained on PubTables-1M).

WalidHadri-Iron commented 1 year ago

@bsmock Any insights on this? Especially that the model is highly sensitive to resizing the image, the idea I mentioned before helps with some tables to get the correct rows but the choice of one common resizing ratio is not evident.

@Saeed11b95 Did you manage to solve this?

Saeed11b95 commented 1 year ago

@WalidHadri-Iron Hi. Actually No, I tried splitting the long tables into smaller tables and then combining the predictions on the splits. But it's not a concrete solution.