microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.31k stars 256 forks source link

Input for TSR model? #69

Open salman-moh opened 2 years ago

salman-moh commented 2 years ago

Hi @bsmock,

From the HuggingFace colab notebook, the table were beng detected flawlessly, however, when I applied TSR on the entire pdf-page image, I got this - it tries to identify rows even in non-table zone. image And then when I tried to pass only the table image - it misses the 4 edges of the table image

Am I missing something here?

Also how would you suggest the post-processing from the postprocessing.py would work? any particular steps you used to obtain a structured format table?

Many thanks in advance.

bsmock commented 2 years ago

This issue is a duplicate of #21 (and possibly others), but because the colab notebook using the models on HuggingFace is new, it's worth re-addressing.

In summary:

Each model also expects (works best on) images resized to particular maximum lengths. The current TD model expects images with a maximum length (maximum of both width and height) of 800. The current TSR model expects images with a maximum length of 1000. This could change with future models/checkpoints.

For now, try expanding the detected bounding box by 30-40 pixels on each side before cropping the table image and before resizing the cropped image to have a maximum length of 1000 pixels.

@NielsRogge Expanding the table bounding box or padding the image is needed for the current TSR model checkpoint but not necessarily other model checkpoints that would be released in the future. Probably this needs to be addressed in the model documentation and colab notebook for using the TSR model--but not added in the pre-processing code for the model. What do you think?

salman-moh commented 2 years ago

FWIW, I've added a +20 white margin around the table and its detection everything perfectly. Along with this, the changes on DetrFeatureExtractor function for TD and TSR are modified to set max len=800 and 1000 respectively.

Sample of out of domain table(sparse dataset from SynthTabNet from IBM) image

Any clues as to how to convert the TSR output to some sort of structured data structure would be appreciated! using your postprocessing.py?

You might think TD is missing out the last row but actually the entire image is just the table so I guess this can be expected but the model does well on the small sample I created.

light42 commented 2 years ago

I find that the model is less accurate when the text is tightly fitted within the cell borders. And it can't be fixed with padding.