hiyali / pilgen

The aim of this repository is to generate datasets (image & its label) for OCR training.
7 stars 0 forks source link

how can I use my own corpus to generate images? #2

Closed damengdameng closed 3 years ago

damengdameng commented 3 years ago

In the code, words are randomly generated and then pictures are generated. so how can I use a fixed corpus to generate data? thanks.

hiyali commented 3 years ago

You can provide your own corpus' word here L121.

damengdameng commented 3 years ago

thank you for the reply.

I tried to put my own corpus like this:

    word = 'جىنپىڭ'
    put_word = ''.join(reversed(word)) # for put into the img
    font = ImageFont.truetype(get_rand_font(), get_rand_font_size(len(put_word)))
    size = font.getsize(put_word)

but the letters on the picture are separated. it seems like Uighur should be first converted to Latin letters through the uly_char_map.

uly_char_map = {
    'ﺎﺋ': { 'Type': 'vowel', 'Latin': ['a', 'A'] },
    'ﺏ':  { 'Type':  None  , 'Latin': ['b', 'B'] },
    'ﭺ':  { 'Type':  None  , 'Latin': ['ch', 'Ch'] },
    'ﺩ':  { 'Type':  None  , 'Latin': ['d', 'D'] },
    'ﻪﺋ': { 'Type': 'vowel', 'Latin': ['e', 'E'] },
    'ﯥﺋ': { 'Type': 'vowel', 'Latin': ['é', 'É'] },
    'ﻑ':  { 'Type':  None  , 'Latin': ['f', 'F'] },
    'ﻍ':  { 'Type':  None  , 'Latin': ['g', 'G'] },
    'ﮒ':  { 'Type':  None  , 'Latin': ['gh', 'Gh'] },
    'ﮪ':  { 'Type':  None  , 'Latin': ['h', 'H'] },
    'ﻰﺋ': { 'Type': 'vowel', 'Latin': ['i', 'I'] },
    'ﺝ':  { 'Type':  None  , 'Latin': ['j', 'J'] },
    'ك':  { 'Type':  None  , 'Latin': ['k', 'K'] },
    'ل':  { 'Type':  None  , 'Latin': ['l', 'L'] },
    'م':  { 'Type':  None  , 'Latin': ['m', 'M'] },
    'ن':  { 'Type':  None  , 'Latin': ['n', 'N'] },
    'ڭ':  { 'Type':  None  , 'Latin': ['ng', 'Ng'] },
    'ﻮﺋ': { 'Type': 'vowel', 'Latin': ['o', 'O'] },
    'ﯚﺋ': { 'Type': 'vowel', 'Latin': ['ö', 'Ö'] },
    'پ':  { 'Type':  None  , 'Latin': ['p', 'P'] },
    'ق':  { 'Type':  None  , 'Latin': ['q', 'Q'] },
    'ر':  { 'Type':  None  , 'Latin': ['r', 'R'] },
    'س':  { 'Type':  None  , 'Latin': ['s', 'S'] },
    'ش':  { 'Type':  None  , 'Latin': ['sh', 'Sh'] },
    'ت':  { 'Type':  None  , 'Latin': ['t', 'T'] },
    'ﯘﺋ': { 'Type': 'vowel', 'Latin': ['u', 'U'] },
    'ﯜﺋ': { 'Type': 'vowel', 'Latin': ['ü', 'Ü'] },
    # v
    'ۋ':  { 'Type':  None  , 'Latin': ['w', 'W'] },
    'خ':  { 'Type':  None  , 'Latin': ['x', 'X'] },
    'ي':  { 'Type':  None  , 'Latin': ['y', 'Y'] },
    'ز':  { 'Type':  None  , 'Latin': ['z', 'Z'] },
    'ژ':  { 'Type':  None  , 'Latin': ['zh', 'Zh'] }
}

But the Uyghur characters I got from here [https://github.com/JaidedAI/EasyOCR/blob/master/easyocr/character/ug_char.txt] is completely different from the one in uly_char_map and some items in uly_char_map seem to be composed of two letters. Can you give some suggestions?

hiyali commented 3 years ago

Here is your answer.

    # from lang.ug.util.convert import br_2_pf
    word = br_2_pf('جىنپىڭ')