mandal4/HangulNet - Githubissues

Benchmark data for Hangul OCR

We construct a new benchmarks for Hangul OCR which has intractable number of classes. The proposed benchmark reveals the class-imbalance and target-class selecting issues in Hangul OCR.

Download Link

Name Link Description

AI Hub
train-set Website* It consists of about 100,000 images of Hangul characters. A total of 674,110 text areas are extracted to evaluate the performance of character recognition. Of these, 10,000 are separated into the test set, and the rest are used as a training data.

AI Hub
test-set Google Drive -

MLT-h
test-set Google Drive MLT dataset was introduced in ICDAR to resolve the problem of multi-lingual text detection and script identification. We exploit only the Hangul text regions in the MLT17 test-set for the evaluation, and name it as MLT-h. We have found many annotation errors in this data set, we rectified those noisy labels.

SFW
test-set Google Drive To emphasize the class-imbalance problem in Korean character recognition, we have synthesized a new dataset containing a large number of minority classes using SynthTiger. The dataset contains a total of 18,831 standard foreign words that are registered in the National Institute of the Korean Language.

Unseen Characters
test-set Google Drive To evaluate robustness on the unseen characters, we have selected 72 characters in SFW that could not be represented with a common character encoding, and generated an image per character.

Name	Link	Description
AI Hub train-set	Website*	It consists of about 100,000 images of Hangul characters. A total of 674,110 text areas are extracted to evaluate the performance of character recognition. Of these, 10,000 are separated into the test set, and the rest are used as a training data.
AI Hub test-set	Google Drive	-
MLT-h test-set	Google Drive	MLT dataset was introduced in ICDAR to resolve the problem of multi-lingual text detection and script identification. We exploit only the Hangul text regions in the MLT17 test-set for the evaluation, and name it as MLT-h. We have found many annotation errors in this data set, we rectified those noisy labels.
SFW test-set	Google Drive	To emphasize the class-imbalance problem in Korean character recognition, we have synthesized a new dataset containing a large number of minority classes using SynthTiger. The dataset contains a total of 18,831 standard foreign words that are registered in the National Institute of the Korean Language.
Unseen Characters test-set	Google Drive	To evaluate robustness on the unseen characters, we have selected 72 characters in SFW that could not be represented with a common character encoding, and generated an image per character.

*AI Hub train-set shall be downloaded from the official website. We cropped text regions for training and this dataset will be available soon.

Sample images

* AI Hub

* `MLT-h`

* `SFW`

* `Unseen Characters`

Citation

Our paper is accepted on ECCV 2022 TiE workshop.

@article{kim2022character,   
title={Character decomposition to resolve class imbalance problem in Hangul OCR},   
author={Kim, Geonuk and Son, Jaemin and Lee, Kanghyu and Min, Jaesik},   
journal={arXiv preprint arXiv:2208.06079},   
year={2022}
}

mandal4 / HangulNet

readme

Benchmark data for Hangul OCR

Download Link

Sample images

* `AI Hub`

* `MLT-h`

* `SFW`

* `Unseen Characters`

Citation

mandal4 / HangulNet

readme

Benchmark data for Hangul OCR

Download Link

Sample images

* AI Hub

* MLT-h

* SFW

* Unseen Characters

Citation

* `AI Hub`

* `MLT-h`

* `SFW`

* `Unseen Characters`