microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.01k stars 231 forks source link

Questions regarding the pubmed datasets. #137

Open k920049 opened 10 months ago

k920049 commented 10 months ago

Hello,

It seems like there is an alternative download page of pubtables-1m on huggingface. Did you applied the canonicalization and consistency adjustment mentioned in the paper, "aligning benchmark datasets for table structure recognition"? Or is it just a copy of the original dataset?

bsmock commented 10 months ago

Right now this is just a copy of the original dataset.

But soon we will update the test and val splits to version 1.1. This version is what is used in the paper "Aligning benchmark datasets for table structure recognition".

In v1.1, the cropped table images have 2 pixels of padding around the table border. In the original dataset (v1.0), these images have ~30 pixels of padding.

The training data/split is the same for v1.0 and v1.1. In other words, the training data still comes with ~30 pixels of padding around the cropped tables.

Hope that helps!

Cheers, Brandon