tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.24k stars 9.51k forks source link

Tesseract does not generate .lstmf for some images #3386

Open DavidHribek opened 3 years ago

DavidHribek commented 3 years ago

Hello,

I want to train tesseract on my own images (text lines). I copied all my images and texts into folder (image_name.[png/jpg], image_name.gt.txt).

Now i run this command:

make training MODEL_NAME=eng \ START_MODEL=eng \ MAX_ITERATIONS=100 \ PSM=7 \ TESSDATA=/path/to/tessdata \ GROUND_TRUTH_DIR=/path/to/folder/with/images/and/texts

It produces .lstmf, .box and .txt file for every image in the folder with message:

Then the training starts.

My problem: For some images second command does not produce .lstmf, .box, .txt files, but tesseract is still waiting for them, so training does not start.

Thanks for help.

stweil commented 3 years ago

Our problem: we don't know your images, but need one of those which don't work.

DavidHribek commented 3 years ago

17_29_1617884209_8154147

For example from this image Tesseract never generates .lstmf, .box, .txt.

stweil commented 3 years ago

Thanks, confirmed.

stweil commented 3 years ago

There is an (unrelated) bug in src/ccmain/pagesegmain.cpp: read_unlv_file is called with a buggy name argument ("114025495-7830ed00-9875-11eb-8889-a9c9ea4003a7\000png.uzn").

DavidHribek commented 3 years ago

I tried to run training with less images including image "Benesov" mentioned above. Lstmf file was not generator for the "Benesov" image, it was skipped and training started. if I run training with all my images, training never starts.

stweil commented 3 years ago

Tesseract does not find a text box in this image. It tries to find lines, finds none (why?) and also does not use --psm 7 as a fallback and accept the whole image as a line.

Call stack:

#0  tesseract::line_edges (x=0, y=87, xext=210, uppercolour=1 '\001', bwpos=0x611000006a80 "", prevline=0x61c000000080, free_cracks=0x7fffffff88c0, outline_it=0x7fffffff8ce0)
    at ../../../src/textord/scanedg.cpp:185
#1  0x0000000001ca7615 in tesseract::block_edges (t_pix=..., block=0x60f000000138, outline_it=0x7fffffff8ce0) at ../../../src/textord/scanedg.cpp:99
#2  0x0000000001c5c805 in tesseract::extract_edges (pix=..., block=0x60f000000130) at ../../../src/textord/edgblob.cpp:330
#3  0x0000000001ebf79b in tesseract::Textord::find_components (this=0x7ffff183e5f0, pix=..., blocks=0x6020000064d0, to_blocks=0x7fffffff9c80)
    at ../../../src/textord/tordmain.cpp:224
#4  0x0000000001e9b2ef in tesseract::Textord::TextordPage (this=0x7ffff183e5f0, pageseg_mode=tesseract::PSM_SINGLE_LINE, reskew=..., width=210, height=88, binary_pix=..., 
    thresholds_pix=..., grey_pix=..., use_box_bottoms=false, diacritic_blobs=0x7fffffff9c60, blocks=0x6020000064d0, to_blocks=0x7fffffff9c80)
    at ../../../src/textord/textord.cpp:185
#5  0x0000000000b34cca in tesseract::Tesseract::SegmentPage (this=0x7ffff181a800, input_file=0x6060000033e0 "114025495-7830ed00-9875-11eb-8889-a9c9ea4003a7.png", 
    blocks=0x6020000064d0, osd_tess=0x0, osr=0x7fffffffa3a0) at ../../../src/ccmain/pagesegmain.cpp:172
#6  0x000000000058fc0d in tesseract::TessBaseAPI::FindLines (this=0x7fffffffdb40) at ../../../src/api/baseapi.cpp:2187
#7  0x0000000000591ea9 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffdb40, monitor=0x0) at ../../../src/api/baseapi.cpp:837
#8  0x00000000005a12f9 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffdb40, pix=0x606000003380, page_index=0, 
    filename=0x7fffffffe668 "114025495-7830ed00-9875-11eb-8889-a9c9ea4003a7.png", retry_config=0x0, timeout_millisec=0, renderer=0x0) at ../../../src/api/baseapi.cpp:1254
#9  0x00000000005a71fe in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffdb40, filename=0x7fffffffe668 "114025495-7830ed00-9875-11eb-8889-a9c9ea4003a7.png", 
    retry_config=0x0, timeout_millisec=0, renderer=0x0) at ../../../src/api/baseapi.cpp:1217
#10 0x00000000005a3102 in tesseract::TessBaseAPI::ProcessPages (this=0x7fffffffdb40, filename=0x7fffffffe668 "114025495-7830ed00-9875-11eb-8889-a9c9ea4003a7.png", 
    retry_config=0x0, timeout_millisec=0, renderer=0x0) at ../../../src/api/baseapi.cpp:1070
#11 0x00000000004e635a in main (argc=6, argv=0x7fffffffe398) at ../../../src/api/tesseractmain.cpp:782
stweil commented 3 years ago

@DavidHribek, you could try removing the left black border from the image. The image can also be cropped below the text. Maybe one of those modifications might fix the problem.

amitdo commented 3 years ago

Tesseract in non training mode will also fail.

This is a know issue.

Try this command (ImageMagick):

convert img1.png -bordercolor White -border 10x10 img2.png