marcodiri / s2s-contrastive-text-recognition

PyTorch implementation of SeqCLR paper. (Exam assignment for Machine Learning course at University of Florence)
4 stars 0 forks source link

Question about training sequence #7

Closed kylechang523 closed 1 year ago

kylechang523 commented 1 year ago

Hi, Thanks for sharing the codes. May I ask about the training pipeline? Is there any instruction for training and results? Thank you!

marcodiri commented 1 year ago

Hi, thanks for the interest. First of all a necessary premise: this is just something I made for an university assignment (my first one in machine learning actually) and I cannot guarantee the correctness and is not in any way meant for any serious application.

Having said that, in the kaggle_notebook folder there are some example scripts I used for training:

As for the results, I tested on IAM dataset and obtained worse results that those in the paper (about 70% accuracy) maybe because of the simpler encoder I used (VGG instead of ResNet). For the same settings as in the paper you can change the scripts input --Transformation None --FeatureExtraction VGG to --Transformation TPS --FeatureExtraction ResNet but I have not tested it.

yusirhhh commented 1 year ago

image Hello, thanks for the shared code. As a result, I find the gap between the pre-trained model and the un-pre-trained model becomes similar as the training epoch increase.

Do you have the same result? Can you give me some advice?

marcodiri commented 1 year ago

Hi, thank you for trying it out.

Do you have the same result? Can you give me some advice?

Is that plot the validation accuracy? If so, no, my plot for the validation accuracy (character accuracy) is this: pre-trained (orange) vs supervised baseline (blue)

validation accuracy

Did you train the model like the examples in the kaggle_notebook folder?

yusirhhh commented 1 year ago

Thanks for your reply. You are right. I got similar results with your setting when using the attn as the decoder.

My results are obtained by using CTC as a decoder and without BiLSTM. During pre-training, I set tau as 0.2. There may be something wrong with my implementation.

Could you share the code of the CTC-based recognitor?

marcodiri commented 1 year ago

Yeah the BiLSTM sequence modeling after the feature extractor is important otherwise you are treating an image of a word like an image of a kitten :) With CTC decoder in the paper they also use a BiLSTM projection head on top of the BiLSTM sequence modeling, while they say it's redundant with Attention because it has LSTM layers itself.

For the CTC decoder implementation I suggest you to look at the Clova AI's project from which I took the encoder-decoder architecture (linked in the readme), in particular their model and their training loop (just look for references to CTC).

yusirhhh commented 1 year ago

image

With CTC decoder in the paper they also use a BiLSTM projection head on top of the BiLSTM sequence modeling, while they say it's redundant with Attention because it has LSTM layers itself.

I overlooked this earlier. Thank you very much!

In the pre-training phase, the encoder (including CNN and BiLSTM) is followed by a projection layer, either MLP or biLSTM.

In downstream tasks, pre-trained encoders and CTC decoders (single-layer fully connected layers) are used to train the network.

So for the recognition network of CTC, the projection layer is required.

I found that the attn performed much better than the CTC decoder in the experiments evaluated by the decoder, but there was little difference in performance between the two in the Finetuning experiments. Is this caused by the attention decoder having multiple layers that can be optimized when the backbone network freezes, while the CTC decoder only has a single fully connected layer to learn?

yusirhhh commented 1 year ago

image In addition, I used VGG-BiLSTM as encoder and attn as decoder to freeze the backbone network. I found that this performance was quite different from that of the original paper.

marcodiri commented 1 year ago

Is this caused by the attention decoder having multiple layers that can be optimized when the backbone network freezes, while the CTC decoder only has a single fully connected layer to learn?

Yeah I think it's because Attn has LSTM layer itself to be optimized while CTC relies on LSTM layers in the encoder, so when the encoder is frozen Attn performs better.

In addition, I used VGG-BiLSTM as encoder and attn as decoder to freeze the backbone network. I found that this performance was quite different from that of the original paper.

That's what I did, but my results where comparable to the paper's (a bit worse). Also I think what they report is the accuracy on the word, not character.