microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.22k stars 247 forks source link

Annotation instructions (TSR) #111

Closed Danferno closed 1 year ago

Danferno commented 1 year ago

Would it be possible to share the annotation instructions for the TSR tasks? E.g. should bounding boxes be tight around text or as close to the implicit row as possible? What is the purpose of the "no object" label?

That would help to finetune the model to custom data.

Thanks!

EDIT: Just read the paper more closely, there are no annotators: the data was generated programatically from XML files

Danferno commented 1 year ago

For anyone else interested, I plotted the exact annotations per annotation type for a sample of the pubtables dataset, that answered most of my own questions images_annotated.zip

bely66 commented 1 year ago

@Danferno Did it yield good results in the end? I'm finetuning with my own data and the Highest AP50 I could get was 83%

Danferno commented 1 year ago

Not really to be honest, I've started developing my own model in the hopes of getting better results but no guarantee it'll be any better...

giuqoob commented 1 year ago

Wish I had seen this earlier, just finished my own instructions yesterday, might as well publish them. They are open to comments, but you need to login to Notion to leave comments.

https://unpoco.notion.site/How-to-annotate-to-match-PubTab1M-and-FinTabNet-32b6ac519d2a405ebddff10a0f2b8259

I'm not 100% sure but as I understand it label no object is just ignored in the Detr architecture, so you can't annotate for that.

Another improtant part for training could be the pre-processing that is done to images.

This is a snippet of what is done in TSR

R.Compose([
            R.RandomSelect(TightAnnotationCrop([0], 30, 30, 30, 30),
                           TightAnnotationCrop([0], 10, 10, 10, 10),
                           p=0.5),
            RandomMaxResize(900, 1100), random_erasing, normalize
        ])

As I understand it for TSR annotations with the label 0 (tables) are padded either by 30 pixels or 10 pixels on all sides at a 50:50 chance. Additionally the image is randomly resized, with random erasure of content and normalized.

So one would have to run their own images through a similar gauntlet, and I'm just not sure if this is happening when using the Huggingface instructions.

@Danferno did you finetune using Huggingface or by using code in this repo?

bely66 commented 1 year ago

@Danferno Did it yield good results in the end? I'm finetuning with my own data and the Highest AP50 I could get was 83%

Wish I had seen this earlier, just finished my own instructions yesterday, might as well publish them. They are open to comments, but you need to login to Notion to leave comments.

https://unpoco.notion.site/How-to-annotate-to-match-PubTab1M-and-FinTabNet-32b6ac519d2a405ebddff10a0f2b8259

I'm not 100% sure but as I understand it label no object is just ignored in the Detr architecture, so you can't annotate for that.

Another improtant part for training could be the pre-processing that is done to images.

This is a snippet of what is done in TSR

R.Compose([
            R.RandomSelect(TightAnnotationCrop([0], 30, 30, 30, 30),
                           TightAnnotationCrop([0], 10, 10, 10, 10),
                           p=0.5),
            RandomMaxResize(900, 1100), random_erasing, normalize
        ])

As I understand it for TSR annotations with the label 0 (tables) are padded either by 30 pixels or 10 pixels on all sides at a 50:50 chance. Additionally the image is randomly resized, with random erasure of content and normalized.

So one would have to run their own images through a similar gauntlet, and I'm just not sure if this is happening when using the Huggingface instructions.

@Danferno did you finetune using Huggingface or by using code in this repo?

That was very informative, thank you for providing the links I think all people on this repo is training using its configuration Did you find the preprocessing mentioned in the paper available here in the repo?

giuqoob commented 1 year ago

All I quote here is from the code - that particular example is from detr > datasets > transforms.py, although if you run the training script all your images will be pre-processed.

When it comes to processing images before they are sent to the training script, I did that by downloading the pubtables1m and fintabnet datasets and pre-processed with the scripts provided. The images in the Notion doc show what they look like when they go to the training script.

I have added a section about what the images look like before they are sent to training. You'll find it under chapter Images in training dataset Link

Here we can see for fintabnet there is a padding of 30 pixels around the table bbox. https://github.com/microsoft/table-transformer/blob/02a2a4eded9ccf70be356fad405bc9555b4ad551/scripts/process_fintabnet.py#L1054

And here you can see how that padding is applied

https://github.com/microsoft/table-transformer/blob/02a2a4eded9ccf70be356fad405bc9555b4ad551/scripts/process_fintabnet.py#L1418C1-L1434

Danferno commented 1 year ago

I used the training script in the repo (see code cutout below). I only used it for table structuring, not detection. For detection I trained a YoloV7 on 6K annotated pages, that worked fairly well (cost about 200$ on fiverr to annotate the pages).


pathIn = PATH_DATA / 'structure500_jpg'
pathPython = sys.executable
pathScript = 'main.py'
pathWD = PATH_ROOT / 'table-transformer' / 'src'
pathModel = PATH_ROOT / 'models' / datetime.now().strftime('%Y_%m_%d-%H_%M')
loadWeightOnly = '--load_weights_only' if FINETUNE else ''

command = f'{pathPython} {pathScript} --data_type structure --mode train --config_file {PATH_CONFIG} --data_root_dir {pathIn}' \
            f' --model_load_path {PATH_WEIGHTS} {loadWeightOnly} --batch_size 4 --model_save_dir {pathModel}' \
            f' --epochs {epochs}'
subprocess.run(command, cwd=pathWD, check=True, stdout=sys.stdout)```
huschi commented 1 year ago

@Danferno thank you for sharing some insights. Did you compare table transformer table detection to YoloV7? Is there a big difference?

huschi commented 1 year ago

Table transformer table detection, do I need to add a padding of 30 around my page, or is it only required for table structure recognition?

Danferno commented 1 year ago

@Danferno thank you for sharing some insights. Did you compare table transformer table detection to YoloV7? Is there a big difference?

There was for me. Accuracy (also in terms of bounding box overlap) with the custom-trained yolov7 was super high (like, 99%) and processing speed was fast. Performance in both accuracy and speed was terrible with the table transformer detection.

Table transformer table detection, do I need to add a padding of 30 around my page, or is it only required for table structure recognition?

I think only for structure recognition