This project contains a pipeline that takes a folder of PDF files (academic papers) and outputs CSV files of tables.
Install the requirements found in requirements.txt using pip install -r requirements.txt
A dataset of generated tables will be published soon. This will include the ground truth .csv files, original .tex files, .png images of the tables, .png images of the table structure.
You can generate a dataset using /tablegenerator/tablegen.py
. See the README file in the tablegenerator folder for more information on this process.
Running the pipeline requires a pretrained model. At least two pretrained models will be made available: pix2pixHD and SegNet. The pix2pixHD model is based on NVIDIA's https://github.com/NVIDIA/pix2pixHD/. The SegNet model is based on https://github.com/GeorgeSeif/Semantic-Segmentation-Suite. (Encoder-Decoder with skipconnections, InceptionV4)
You can run the pipeline using python ./pipeline/batch.py
. Following options are available:
pip install -r requirements.txt
Pretrained models and a small annotated test set can be found in the following google drive: https://drive.google.com/drive/u/0/folders/1dgKISbhBNfR8XXnIxUD_sIhwYNurKHbb