Input for TSR model? - Githubissues

salman-moh commented 2 years ago

Hi @bsmock,

From the HuggingFace colab notebook, the table were beng detected flawlessly, however, when I applied TSR on the entire pdf-page image, I got this - it tries to identify rows even in non-table zone. And then when I tried to pass only the table image - it misses the 4 edges of the table

Am I missing something here?

Also how would you suggest the post-processing from the postprocessing.py would work? any particular steps you used to obtain a structured format table?

Many thanks in advance.

bsmock commented 2 years ago

This issue is a duplicate of #21 (and possibly others), but because the colab notebook using the models on HuggingFace is new, it's worth re-addressing.

In summary:

The TSR model could learn to work on tightly cropped table images if we trained it on these images.
The TSR model we trained in the original PubTables-1M paper was not trained to recognize tightly cropped table images. Instead it was trained on images with padding around the table, so it expects some padding around the table to be included in the image at inference time.

Each model also expects (works best on) images resized to particular maximum lengths. The current TD model expects images with a maximum length (maximum of both width and height) of 800. The current TSR model expects images with a maximum length of 1000. This could change with future models/checkpoints.

For now, try expanding the detected bounding box by 30-40 pixels on each side before cropping the table image and before resizing the cropped image to have a maximum length of 1000 pixels.

@NielsRogge Expanding the table bounding box or padding the image is needed for the current TSR model checkpoint but not necessarily other model checkpoints that would be released in the future. Probably this needs to be addressed in the model documentation and colab notebook for using the TSR model--but not added in the pre-processing code for the model. What do you think?

salman-moh commented 2 years ago

FWIW, I've added a +20 white margin around the table and its detection everything perfectly. Along with this, the changes on DetrFeatureExtractor function for TD and TSR are modified to set max len=800 and 1000 respectively.

Sample of out of domain table(sparse dataset from SynthTabNet from IBM)

Any clues as to how to convert the TSR output to some sort of structured data structure would be appreciated! using your postprocessing.py?

You might think TD is missing out the last row but actually the entire image is just the table so I guess this can be expected but the model does well on the small sample I created.

light42 commented 2 years ago

I find that the model is less accurate when the text is tightly fitted within the cell borders. And it can't be fixed with padding.

microsoft / table-transformer

Input for TSR model? #69