We construct a new benchmarks for Hangul OCR which has intractable number of classes. The proposed benchmark reveals the class-imbalance and target-class selecting issues in Hangul OCR.
Download Link
Name Link Description AI Hub
train-setWebsite* It consists of about 100,000 images of Hangul characters. A total of 674,110 text areas are extracted to evaluate the performance of character recognition. Of these, 10,000 are separated into the test set, and the rest are used as a training data. AI Hub
test-setGoogle Drive - MLT-h
test-setGoogle Drive MLT dataset was introduced in ICDAR to resolve the problem of multi-lingual text detection and script identification. We exploit only the Hangul text regions in the MLT17 test-set for the evaluation, and name it as MLT-h. We have found many annotation errors in this data set, we rectified those noisy labels. SFW
test-setGoogle Drive To emphasize the class-imbalance problem in Korean character recognition, we have synthesized a new dataset containing a large number of minority classes using SynthTiger. The dataset contains a total of 18,831 standard foreign words that are registered in the National Institute of the Korean Language. Unseen Characters
test-setGoogle Drive To evaluate robustness on the unseen characters, we have selected 72 characters in SFW that could not be represented with a common character encoding, and generated an image per character.
*AI Hub train-set shall be downloaded from the official website. We cropped text regions for training and this dataset will be available soon.
Sample images
*
AI Hub
MLT-h
SFW
Unseen Characters
Citation
Our paper is accepted on ECCV 2022 TiE workshop.
@article{kim2022character, title={Character decomposition to resolve class imbalance problem in Hangul OCR}, author={Kim, Geonuk and Son, Jaemin and Lee, Kanghyu and Min, Jaesik}, journal={arXiv preprint arXiv:2208.06079}, year={2022} }