microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.01k stars 231 forks source link

Table annotations within full page images #136

Open alfassy opened 10 months ago

alfassy commented 10 months ago

Hi! I would like to train a model with your data but having the Structure annotations of rows, columns, etc. in the full pages images which are found in the Detection data. Going over your data, I couldn't find any mapping between the annotations in the Detection and Structure datasets, do you have any such mapping? maybe in the script that creates the annotations? I would really appreciate the help and we can publish that together afterwards. Thank you, Amit

bsmock commented 10 months ago

Hi,

In theory the structure annotations on top of the full-page table detection images should be recoverable from the PDF-Annotations data.

However, something to note is that for PubTables-1M, a very small percentage of tables in the current Structure dataset will not be able to be included in a full-page structure dataset.

The reason for this is that a full-page image exists in the Detection dataset only if every table on that page was able to be recognized during the dataset creation stage. Sometimes the dataset creation script could recognize at least one table on a page but not necessarily all tables on the page. In this case, any successfully recognized table is included in the Structure dataset, but the full page is excluded from the Detection dataset, since it would only be partially annotated.

I believe I should be able to write a script to create a full-page Structure dataset using the Detection data and the PDF-Annotations data. I'll give it a try and share it if it's successful.

Best, Brandon