huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.32k stars 2.7k forks source link

Add PubTables-1M #5261

Open NielsRogge opened 2 years ago

NielsRogge commented 2 years ago

Name

PubTables-1M

Paper

https://openaccess.thecvf.com/content/CVPR2022/html/Smock_PubTables-1M_Towards_Comprehensive_Table_Extraction_From_Unstructured_Documents_CVPR_2022_paper.html

Data

https://github.com/microsoft/table-transformer

Motivation

Table Transformer is now available in 🤗 Transformer, and it was trained on PubTables-1M. It's a large dataset for table extraction and structure recognition in unstructured documents.

NielsRogge commented 2 years ago

cc @albertvillanova the author would like to add this dataset to the hub: https://github.com/microsoft/table-transformer/issues/68#issuecomment-1319114621. Could you help him out?