Structure decoder with beam size 3 not working

Sharathmk99 commented 3 years ago

Hi @zhxgj

I went through your paper, its amazing paper. Thank you:)

Initially i thought of only training for structure task. I started with amazing tutorial https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning Downloaded the PubTabNet 2.0 dataset and preprocessed with below filters,

html_max_token_length = 300
image_shape = (512, 512)
image_resize = (448, 448)

I took only 100k examples from train and validation of 5k. I got wordmap of size 32(including start, end, pad and unk)

{"<thead>": 1, "<tr>": 2, "<td>": 3, "</td>": 4, "</tr>": 5, "</thead>": 6, "<tbody>": 7, "</tbody>": 8, "<td": 9, " colspan=\"5\"": 10, ">": 11, " colspan=\"2\"": 12, " colspan=\"3\"": 13, " rowspan=\"2\"": 14, " colspan=\"4\"": 15, " colspan=\"6\"": 16, " rowspan=\"3\"": 17, " colspan=\"9\"": 18, " colspan=\"10\"": 19, " colspan=\"7\"": 20, " rowspan=\"4\"": 21, " rowspan=\"5\"": 22, " rowspan=\"9\"": 23, " colspan=\"8\"": 24, " rowspan=\"8\"": 25, " rowspan=\"6\"": 26, " rowspan=\"7\"": 27, " rowspan=\"10\"": 28, "<unk>": 29, "<start>": 30, "<end>": 31, "<pad>": 0}

Finally i started training with below configuration,

emb_dim = 16
attention_dim = 256
decoder_dim = 256
dropout = 0.5
batch_size = 8
encoder_lr = 1e-4
decoder_lr = 4e-4
encoded_image_size=14
encoder_dim=2048 # In decoder class
encoder_fine_tune = True # Enabled fine tuning of resnet101 blocks 2 through 4

On my first epoch, accuracy went up to 95%, which is not correct. I'm doing something wrong. Example snapshot,

Epoch: [0][0/11997] Batch Time 3.392 (3.392)    Data Load Time 0.491 (0.491)    Loss 3.5033 (3.5033)    Top-5 Accuracy 20.914 (20.914)
Epoch: [0][100/11997]   Batch Time 2.418 (2.493)    Data Load Time 0.000 (0.005)    Loss 1.3730 (1.8123)    Top-5 Accuracy 96.483 (92.530)
Epoch: [0][200/11997]   Batch Time 2.541 (2.484)    Data Load Time 0.000 (0.003)    Loss 0.8896 (1.4701)    Top-5 Accuracy 98.584 (94.610)
Epoch: [0][300/11997]   Batch Time 2.524 (2.482)    Data Load Time 0.000 (0.002)    Loss 0.6605 (1.2468)    Top-5 Accuracy 99.691 (95.879)

After 3 epoch, i took best model and run the inference with beam size as 3,

If i increase the beam size to 10 i get below output,

But above output doesn't change if i pass different image.

Can you help to point what i'm doing wrong here. I could have trained for some more epoch, but i felt accuracy and loss was not looking correct.

Please help.

zhxgj commented 3 years ago

Hmm, I did not see anything obviously wrong. I can only make some guesses. First, I did not use pre-trained resnet, because I think pretrained model on imagenet would not help much with tables. I trained the whole resnet from scratch. Second, when you pre-process the images into h5 files, what resolution did you use? Another thing worth of trying is pass training images into your model and see what you get. If it does not even work on training samples (where your training log shows high accuracy), I thing there maybe something wrong in your inference code, or maybe your training batch does not iterate through your data properly.

Sharathmk99 commented 3 years ago

Hi @zhxgj Thank you for your quick response. I have pre-processed the image to (448, 448).

Let me fine tune all the layers of resnet and give it a try. Did you fine tune the decoder embedding layer also?

I'll also check passing the training image to inference code and check. Will update here.

Sharathmk99 commented 3 years ago

Hi @zhxgj I tried to use resnet101 without pretrained model and fine tune all the layers of resnet101 as below,

modules = list(resnet.children())[:-2]
self.resnet = nn.Sequential(*modules)

Fine tuning,

def fine_tune(self, fine_tune=True):
    for p in self.resnet.parameters():
        p.required_grad = False
    for c in list(self.resnet.children()):
        for p in c.parameters():
            p.required_grad = fine_tune

After first epoch i see below,

Epoch: [0][15000/15996] Batch Time 2.327 (2.084)    Data Load Time 0.000 (0.000)    Loss 0.2146 (0.3881)    Top-3 Accuracy 99.900 (99.432)
Epoch: [0][15500/15996] Batch Time 1.728 (2.084)    Data Load Time 0.000 (0.000)    Loss 0.3417 (0.3861)    Top-3 Accuracy 99.675 (99.444)
Validation: [0/804] Batch Time 0.897 (0.897)    Loss 0.5035 (0.5035)    Top-5 Accuracy 99.708 (99.708)  
Validation: [500/804]   Batch Time 0.770 (0.706)    Loss 0.2969 (0.3245)    Top-5 Accuracy 99.903 (99.906)  

 * LOSS - 0.324, TOP-5 ACCURACY - 99.911, BLEU-4 - 0.9331066764637999

Second epoch,

Epoch: [1][15000/15996] Batch Time 2.322 (2.079)    Data Load Time 0.000 (0.000)    Loss 0.2136 (0.3228)    Top-3 Accuracy 99.900 (99.816)
Epoch: [1][15500/15996] Batch Time 1.737 (2.079)    Data Load Time 0.000 (0.000)    Loss 0.3305 (0.3226)    Top-3 Accuracy 99.838 (99.817)
Validation: [0/804] Batch Time 0.871 (0.871)    Loss 0.4967 (0.4967)    Top-5 Accuracy 99.854 (99.854)  
Validation: [500/804]   Batch Time 0.771 (0.711)    Loss 0.2525 (0.3136)    Top-5 Accuracy 99.903 (99.933)  

 * LOSS - 0.314, TOP-5 ACCURACY - 99.934, BLEU-4 - 0.9406472943100782

Still i feel i'm doing something wrong.

As suggested, i tried to use training image for inference, but not getting correct output. Inference output ['<start>', '<thead>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '</thead>', '<tbody>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '</tbody>', '<end>']

Actual Output, ['<thead>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '</thead>', '<tbody>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '<td>', '</td>', '</tr>', '</tbody>']

I have not changed anything from caption.py from https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning/blob/master/caption.py

And also regarding training, i have not changed any major code. I tried with flickr8k with my code base, it works as expected. But not getting correct output for table structure. Requesting to please help.

Sharathmk99 commented 3 years ago

I'm also trying to run the training with top1 word accuracy, i think which is nothing but greedy method.

One more question, have you used any transforms function for normalize? I'm using below, normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Sharathmk99 commented 3 years ago

@zhxgj I tried to run the inference while training the model it self. Output from the model is correct, in this case teacher forcing is enabled. But if i use trained model without teacher forcing the output is bad.

PS: After removing the normalize function, by model has improved little bit. Still not working good.

Looking forward for your response.

yongshuaihuang commented 3 years ago

Hi @Sharathmk99 ， have you solved the problem yet？

ibm-aur-nlp / PubTabNet

Structure decoder with beam size 3 not working #12