Table annotations within full page images

microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

MIT License

2.01k stars 231 forks source link

Hi,

In theory the structure annotations on top of the full-page table detection images should be recoverable from the PDF-Annotations data.

However, something to note is that for PubTables-1M, a very small percentage of tables in the current Structure dataset will not be able to be included in a full-page structure dataset.

The reason for this is that a full-page image exists in the Detection dataset only if every table on that page was able to be recognized during the dataset creation stage. Sometimes the dataset creation script could recognize at least one table on a page but not necessarily all tables on the page. In this case, any successfully recognized table is included in the Structure dataset, but the full page is excluded from the Detection dataset, since it would only be partially annotated.

I believe I should be able to write a script to create a full-page Structure dataset using the Detection data and the PDF-Annotations data. I'll give it a try and share it if it's successful.

Best, Brandon

microsoft / table-transformer

Table annotations within full page images #136