open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.22k stars 739 forks source link

Training ABINET on Custom Dataset #1166

Open bely66 opened 2 years ago

bely66 commented 2 years ago

Hi Everyone, I'm using my own custom dataset to train ABINET on a non-latin language. I went through the following steps: Generated labels.txt for the data with the OCRDataset format. Added a config file for the new dataset in configs/base/rec_datasets Changed the dataset path to the new config in abinet_academic.py in configs/rec/ then added dict_list with the new language characters in the configs/base/rec_models/abinet.py

I faced a problem because the num_chars is hardcoded in the file:

image

I changed it to the following:

image

I added len(dict_list)+1 because it was failing without adding one.

I have Two Questions:

  1. Did I miss any steps that could make the training go wrong or not progress?
  2. I'm Training using a one V100 GPU and I tested with a small dataset size of 10k samples and It'll take 1 day to train for 20 epochs, I'm sensing that's a very long time?
gaotongxiao commented 2 years ago
  1. The steps are fine. Currently num_chars has to be hardcoded, but this issue will be eliminated in our upcoming release.
  2. ABINet is huge and it took us 5 days to finish up the training on 2xA100 GPUs.
bely66 commented 2 years ago

so the num_chars should be the exact number? because I have to add 1 for the training to run

gaotongxiao commented 2 years ago

@bely66 That's right. The extra character comes from the special "<BOS/EOS>" token.

e4s2022 commented 1 year ago

Hello, there.

I also tried to train the ABINet (vision-only model) on a custom dataset. The dataset contains around 27k training samples and 5k test samples. The chars predicted include numbers(0-9), letters(a-z A-Z), Chinese characters, and some punctuation, leading to ~5K in total. That is to say, the output dimension of the last linear layer is ~5K.

I trained the ABINet by using the default configuration with 20 epochs. I noticed the metrics are pretty much low compared with the reference model.

My results at 19-th epoch:

2022-08-29 17:24:01,492 - mmocr - INFO - Epoch(val) [19][4769]  
0_char_recall: 0.1754, 
0_char_precision: 0.2006, 
0_word_acc: 0.0315, 
0_word_acc_ignore_case: 0.0317, 
0_word_acc_ignore_case_symbol: 0.0323, 
0_1-N.E.D: 0.1470

My loss is always hovering around 4, while the reference loss value is around 0.2.

Friends, do you have any improvement updates? Especially for Chinese text recognition.