microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.02k stars 232 forks source link

How to load fine-tuned model into Hugging-Face Table-Transformer #127

Open Prabhav55 opened 1 year ago

Prabhav55 commented 1 year ago

Hi,

Recently I have tried fine-tuning the table transformer model with a small dataset. However, I was wondering if there is a way to load the model into hugging face's TableTransformerForObjectDetection.

When I try to do the same with a path to the .pth file -> It asks for a config. If I pass the config (structure_recognition.json), it drops a lot of the weights due to some issue with structure not being same.

Any help regarding this would be really nice!

Thanks, Prabhav

thiagodma commented 1 year ago

Hey @Prabhav55 ! Probably you should use the Hugging Face model for fine-tuning. There are some things different between the plain pytorch model here in this repo and the HF model.

As TATR is just a DETR, you can use the notebook for fine-tuning a DETR as a reference.

Ashwani-Dangwal commented 12 months ago

Hey @thiagodma , I too did notice some difference in both the models of hugging face and the original model. Did you manage to understand why this is so? Also were you able to fine tune the hugging face model, If yes then can you point out the changes to do in this notebook as mentioned in here.

The only difference is that the Table Transformer applies a "normalize before" operation, which means that layernorms are applied before, rather than after MLPs/attention.

Thanks

giuqoob commented 12 months ago

A comment here - I haven't digged deeper in the HF models innards, but I've gone through the full TATR repo and I can say there are significant steps that relate to how the images are processed I'm not sure are implemented in the HF model. This relates to the conversation about whether e.g. in structure recognition images should be tightly cropped around the bbox (more recent paper) or have some padding around it (older paper). The current weights are based on the older paper, so with more padding. The code in the repo assumes tighter padding (at least the image pre-processing scripts). The result is completely different training data, which may or may not impact results, but from experience I can tell that training this model from scratch will take a week, so you might want to take that into consideration before going with the HF model.

I'm not a CV expert (rather, a beginner) but I'd assume that with fine tuning the issue is the same when it comes to structure recognition especially. If the model has a million examples of padded images as training data and you fine-tune it with tightly cropped images, the model may not do what you want. I think I observed something like this a month ago, where I noticed that the structure recognition model was always assuming there is a padding around the table when doing inference, yielding bboxes that were smaller around the edges.

So I recommend to at least go through what happens to images before they are passed to training, from main.py and linked files. But as said, the weights do not match the code at the moment - code is for the recent paper (weights not released yet) and weights are for the old paper (different transformations to data).

thiagodma commented 12 months ago

@Ashwani-Dangwal yeah, I managed to fine-tune it. Basically all I did was:

Replace processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50") for processor = DetrImageProcessor()

Replace:

DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50",
                                                             revision="no_timm", 
                                                             num_labels=len(id2label),
                                                             ignore_mismatched_sizes=True)

for:

TableTransformerForObjectDetection.from_pretrained(
      "microsoft/table-transformer-structure-recognition",
      ignore_mismatched_sizes=True,
  )
WalidHadri-Iron commented 12 months ago

@thiagodma For the dataset preparation, I wonder on what dataset you fine-tuned it. If you did on FinTabNet, did you use the code here to canonize the cells and get the FinTabNet.c? And how did you manage if it's the case or not, if you used FinTabNet, to prepare the dataset to be used with the HF training.

@giuqoob I agree with you that training or fine-tuning takes days, I did some fine-tune on the FinTabNet using the code in this repo, I was able to see good improvements in the scores. Yet, I think there is some problem with the row detection. Were you able to notice the same issue? I wonder if it's due to bad training or just a limit to DETR.

giuqoob commented 12 months ago

@WalidHadri-Iron I created the FinTabNet(.a6) and pubtables datasets with the scripts provided and trained the model from scratch without a limit to batches per epoch, running it for 22 epochs. I haven't tested on my own data yet, but using the eval mode I got these results, which are worse than what authors report. I used the following params

I didn't touch any other settings, like learning rate.

What results did you get?

For model trained on 20 epochs
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.785
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.949
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.874
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.513
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.700
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.814
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.407
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.776
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.852
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.602
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.777
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.887
----------------------------------------------------------------------------------------------------
Results on simple tables (47784 total):
      Accuracy_Con: 0.8406
         GriTS_Top: 0.9838
         GriTS_Con: 0.9805
         GriTS_Loc: 0.9715
--------------------------------------------------
Results on complex tables (55339 total):
      Accuracy_Con: 0.5572
         GriTS_Top: 0.9598
         GriTS_Con: 0.9586
         GriTS_Loc: 0.9382
--------------------------------------------------
Results on all tables (103123 total):
      Accuracy_Con: 0.6885
         GriTS_Top: 0.9709
         GriTS_Con: 0.9688
         GriTS_Loc: 0.9536
--------------------------------------------------
Total time taken for 103123 samples: 13:09:16.739213
COCO metrics summary: AP50: 0.949, AP75: 0.874, AP: 0.785, AR: 0.852
thiagodma commented 12 months ago

@thiagodma For the dataset preparation, I wonder on what dataset you fine-tuned it. If you did on FinTabNet, did you use the code here to canonize the cells and get the FinTabNet.c? And how did you manage if it's the case or not, if you used FinTabNet, to prepare the dataset to be used with the HF training.

@WalidHadri-Iron I fine-tuned it using a proprietary dataset

Ashwani-Dangwal commented 12 months ago

@thiagodma Thanks for the help. Btw the format of the training data for hugging face model must be in pascal voc format or is it something else?

thiagodma commented 12 months ago

@Ashwani-Dangwal I'm using COCO format but I think this is something easy to change. I guess all you have to do is to change the pytorch Dataset definition

Ashwani-Dangwal commented 12 months ago

@WalidHadri-Iron Got the same type of error after fine tuning with fintabnet dataset in which the complete row was not detected properly. Did you get a work around this?

Ashwani-Dangwal commented 12 months ago

@thiagodma Thanks Man!

Prabhav55 commented 12 months ago

@giuqoob Did you fine-tune on the complete FinTabNet dataset using hugging face model as the base? Unfortunately I fine-tuned on the same using the main.py script and not the hugging face model.

If you did, would it be ok to share the FinTabNet weights? Would really appreciate that.

Thanks!

WalidHadri-Iron commented 12 months ago

@giuqoob I just fine-tuned on FinTabNet, I kept the configuration the same for TSR, except the learning rates that I changed: from "lr":5e-5,"lr_backbone":1e-5 to "lr":1e-5, "lr_backbone":1e-6. I fine-tuned for 15 epochs, since the first epoch we can see an important jump in the metrics, from epoch 10 to epoch 15, no big gain. I haven't had the time to run the eval script, but I will share the metrics for the best epoch when running on a sample of the test set during the training.

IoU metric: bbox
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.866
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.971
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.924
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.587
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.848
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.869
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.502
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.870
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.914
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.616
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.900
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.922

 AP50:0.971, AP75:0.924, AP:0.866 AR:0.914

I will update with the full output of the eval code when I will have the time to run it. I am also thinking about changing the whole training setting to do better, this was just my first trial.

@Ashwani-Dangwal: As a workaround, I put a low threshold for rows and enhanced the post-processing based on the text position and some characteristics.

WalidHadri-Iron commented 12 months ago

Hi,

Recently I have tried fine-tuning the table transformer model with a small dataset. However, I was wondering if there is a way to load the model into hugging face's TableTransformerForObjectDetection.

When I try to do the same with a path to the .pth file -> It asks for a config. If I pass the config (structure_recognition.json), it drops a lot of the weights due to some issue with structure not being same.

Any help regarding this would be really nice!

Thanks, Prabhav

To answer the initial question, check my comment here https://github.com/NielsRogge/Transformers-Tutorials/issues/316#issuecomment-1624001118

bely66 commented 11 months ago

@WalidHadri-Iron @giuqoob Available FinTabNet Dataset here Can be processed to match the new paper using code from this repo, would it be hard to point to these folders?

Also I wanted to understand which old paper are you referring to that the model is trained on?

Ashwani-Dangwal commented 11 months ago

Has anyone managed to train the hugging face model in the fintabnet dataset?

WalidHadri-Iron commented 11 months ago

@WalidHadri-Iron @giuqoob Available FinTabNet Dataset here Can be processed to match the new paper using code from this repo, would it be hard to point to these folders?

Also I wanted to understand which old paper are you referring to that the model is trained on?

@bely66 The scripts are in here https://github.com/microsoft/table-transformer/tree/main/scripts.

As @giuqoob pointed out before, if anyone is interested in using the code here to train/infer/process some data, I would also recommend spending some minimum time exploring the files in the repo.

Ashwani-Dangwal commented 11 months ago

@WalidHadri-Iron, have you tried training the hugging face model on the FinTabNet dataset?

Prabhav55 commented 11 months ago

@Ashwani-Dangwal I have tried training it on FinTabNet and it worked for me after changes to the model weights using the following script -> convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py, present is transformers repo.

Ashwani-Dangwal commented 11 months ago

@Prabhav55 , I also trained the original model on FinTabNet but it as not performing well. So i wanted to fine tune the hugging face model on FinTabNet.

WalidHadri-Iron commented 11 months ago

@Ashwani-Dangwal I did exactly like @Prabhav55

bsmock commented 11 months ago

If you're training on the original PubTables-1M and FinTabNet.c (FinTabNet.a6) together then one reason you may see lower numbers during evaluation is we changed how we evaluate on PubTables-1M in our most recent paper. To match these numbers you have to run https://github.com/microsoft/table-transformer/blob/42867c86768388ca4cafd546178abfb15c63aed3/scripts/create_padded_dataset.py on the validation and test splits to more tightly crop these images. The training data stays the same, so do not run the script on the training split. This step is not yet documented, sorry for that.

And just to add extra clarification, if you trained on PubTables-1M and FinTabNet.c using the current code without doing this step, there is no need to redo the training. You only need to do this cropping step for evaluation to match our reported numbers. We will update documentation for this soon.

Best, Brandon

bely66 commented 11 months ago

@WalidHadri-Iron @giuqoob Available FinTabNet Dataset here Can be processed to match the new paper using code from this repo, would it be hard to point to these folders? Also I wanted to understand which old paper are you referring to that the model is trained on?

@bely66 The scripts are in here https://github.com/microsoft/table-transformer/tree/main/scripts.

As @giuqoob pointed out before, if anyone is interested in using the code here to train/infer/process some data, I would also recommend spending some minimum time exploring the files in the repo.

yep got your point, and went through the repo So basically the model is trained on old annotations and the code expects new annotations

so If i finetuned the model with the old annotations I will still be getting bad results because of the code?

@bsmock Would it be hard to confirm that, If I’m finetuning my data with the old annotations there’d be problems from the code?

linkstatic12 commented 10 months ago

Can you share the FinTabNet model here.

NielsRogge commented 7 months ago

Hi,

See #158

ali4friends71 commented 2 months ago

Hi @Prabhav55 . Were you able to load the model after training it ? It is asking for config. WHen I'm giving it the config, then some params are dropping and I'm unable to load it. If you did it, can you please let me know how did u do ? The python file "convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py" is also not available now at the mentioned link If you have it, can you please share it with me? Thanks in advance @bsmock @NielsRogge