microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2k stars 230 forks source link

Annotation Tool #86

Open abhayhk2001 opened 1 year ago

abhayhk2001 commented 1 year ago

Hi we are trying to use this model for custom training. We have a set of images we would like to fine tune on. We were able to generate the XML files using LabelImg. But the words.json file is a little tricky. Can you please share the annotation tool used or suggest an alternative.

bsmock commented 1 year ago

For some context, the format and naming of the fields for the words JSON files originates with the text extraction in PyMuPDF, which for each word gives block_num, line_num, and span_num.

The current version of the Table Transformer code for incorporating text into the table extraction needs 'span_num' to give the numerical order in which words should be placed when assembling the text placed into each cell. 'line_num' and 'block_num' can both be set to 0 for all words as long as 'span_num' gives the reading order.

Going forward, I believe we should refactor the code to ignore these fields altogether and assume the list is already sorted in reading order. This would simplify things because then the only fields that would be needed for each word would be 'bbox' for the bounding box and 'text' for the text content.

bsmock commented 1 year ago

Check the newly-created scripts/ folder for code that creates the words JSON files from PDF for datasets where PDFs are available, such as PubTables-1M, FinTabNet, and SciTSR.