tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
629 stars 182 forks source link

Error generate text2image using khm.training_text #372

Closed KaiNKaiHo closed 7 months ago

KaiNKaiHo commented 7 months ago

Hi everyone!! I just recently using tesseract and I want to train it using Khmer language and I try generate text2image but all I got was blank image. Here is part of my code which I think where the problem is:


training_text_file = 'tesstrain/data/khm/khm.training_text'

lines = []

with open(training_text_file, 'r') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

output_directory = 'tesstrain/data/Test'

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

random.shuffle(lines)

count = 100

lines = lines[:count]

line_count = 0
for line in lines:
    training_text_file_name = pathlib.Path(training_text_file).stem
    line_training_text = os.path.join(output_directory, f'{training_text_file_name}_{line_count}.gt.txt')
    with open(line_training_text, 'w') as output_file:
        output_file.writelines([line])

    file_base_name = f'khm_{line_count}'

    subprocess.run([
        'text2image',
        '--font=DejaVu Sans',
        f'--text={line_training_text}',
        f'--outputbase={output_directory}/{file_base_name}',
        '--max_pages=1',
        '--leading=32',
        '--xsize=3600',
        '--ysize=480',
        '--char_spacing=1.0',
        '--exposure=0',
        '--unicharset_file=tesstrain/data/khm/Khmer.unicharset'
    ])

    line_count += 1
zdenop commented 7 months ago

Please do not post your custom code. We have no resources to fix your mistakes. Provide only errors in the official training process. Error/bugs of training utilities should reported at the https://github.com/tesseract-ocr/tesseract/issues