tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Failed to read boxes #304

Open NoxideLive opened 2 years ago

NoxideLive commented 2 years ago

I am trying to use the tool and just run the tutorial setup.

However when running make training i get an error

Failed to read boxes from data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif

This also happens with my custom setup text

As a note i am running this in WSL on Windows 10. And it is tesseract command which is generating the issue

51yu commented 2 years ago

any update to ^^, run into similar issue

make training
+ tesseract data/foo-ground-truth/fontane_irrungen_1888_0258_008.tif data/foo-ground-truth/fontane_irrungen_1888_0258_008 --psm 13 lstm.train
Failed to read boxes from data/foo-ground-truth/fontane_irrungen_1888_0258_008.tif
Error during processing.
make: *** [data/foo-ground-truth/fontane_irrungen_1888_0258_008.lstmf] Error 1

tesseract build version 5.1.0-32-gf36c0, macOS

pannich commented 2 years ago

same here

make training MODEL_NAME=tha2               [🐍 train1tesseract]
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/tha2-ground-truth/8-th.png" -t "data/tha2-ground-truth/8-th.gt.txt" > "data/tha2-ground-truth/8-th.box"
+ tesseract data/tha2-ground-truth/8-th.png data/tha2-ground-truth/8-th --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/tha2-ground-truth/5-th.png" -t "data/tha2-ground-truth/5-th.gt.txt" > "data/tha2-ground-truth/5-th.box"
+ tesseract data/tha2-ground-truth/5-th.png data/tha2-ground-truth/5-th --psm 13 lstm.train
+ tesseract data/tha2-ground-truth/3-th.png data/tha2-ground-truth/3-th --psm 13 lstm.train
Failed to read boxes from data/tha2-ground-truth/3-th.png
Error during processing.
make: *** [data/tha2-ground-truth/3-th.lstmf] Error 1

it works with your sample data (ocrd) and also works with some of my images. But it doesn't work with '3-th' image that flagged error here.

tesseract 5.1.0-72-gb8b6 , leptonica-1.82.0 , libgif 5.2.1 : libjpeg 9e : libpng 1.6.37 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.5.0 , Found SSE4.1 Found libarchive 3.6.1 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2 Found libcurl/7.77.0 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.42.0 Mac OS m1

Resolved: I managed to resolve this by inspecting the groundtruth files. (If I remembered correctly,) ground truth .txt file cannot be empty

stefanCCS commented 2 years ago

I have had the same issue. I have started the 'make' with make GROUND_TRUTH_DIR=<myGTDir> MODEL_NAME=<mymodelname> lists

The image which creates the error: OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001 bin

According Box-File: boxfile.zip

Full error message on console:

PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png" -t "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.gt.txt" > "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.box"
+ tesseract /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001 --psm 13 lstm.train
Failed to read boxes from /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png
Error during processing.
make: *** [Makefile:225: /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.lstmf] Error 1
pannich commented 2 years ago

I have had the same issue. I have started the 'make' with make GROUND_TRUTH_DIR=<myGTDir> MODEL_NAME=<mymodelname> lists

The image which creates the error: OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001 bin

According Box-File: boxfile.zip

Full error message on console:

PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png" -t "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.gt.txt" > "/home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.box"
+ tesseract /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001 --psm 13 lstm.train
Failed to read boxes from /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.bin.png
Error during processing.
make: *** [Makefile:225: /home/tessadmin/tesstrain/data/cyrillicEvalPartly/train/OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.lstmf] Error 1

hey , I managed to solve my issue by inspecting the grountruth .txt file. What's in your OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.gt.txt file?

stefanCCS commented 2 years ago

hey , I managed to solve my issue by inspecting the grountruth .txt file. What's in your OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.gt.txt file?

Many thanks - looks like you are right. My GT file is empty (has a 12 bytes, but no (visible) text). See this file here: OCR-D-SEG-LINE-CCS-IMG-BL-4792_007818296_00769_TR-5_TR-5_line0001.gt.txt.zip

zdenop commented 1 year ago

Problem is that e.g. https://github.com/tesseract-ocr/tesstrain/blob/main/generate_line_box.py print output to stdout and therefore Makefile creates box file even there are no data. @kba, @stweil : Is there any tool processing box data from stdout? IMO this functionality should be rewritten, so box file is directly and only in case of real data.

bertsky commented 7 months ago

@zdenop I am trying to understand. So the GT text file is bogus, but not to the extent that generate_line_box.py would raise an exception and thus cause non-zero exit value, correct? And you want to prevent empty output, correct?

In that case, I would suggest simply changing the Python script to exit with non-zero if the condition (if line) cannot be met.

zdenop commented 7 months ago

I already started to write my tools to replace some makefile functionality. I have following function that handle empy files correctly:

import pathlib
import unicodedata
from PIL import Image

red = '\033[91m'
reset = '\033[0m'

def generate_line_box(gt_txt, image_path, output_path):
    """Creates tesseract box files for given (line) image text pairs"""
    lines=pathlib.Path(gt_txt).read_text(encoding='utf-8').splitlines()
    if  len(lines) != 1:
        print(f"{red}Invalid gt_txt file: {gt_txt}{reset}")
        return False
    line = unicodedata.normalize('NFC', lines[0].strip())
    if not line:
        print(f"{red}Can not normalize line in gt_txt file: {blue}{gt_txt}{reset}")
        return False
    with Image.open(image_path) as image:
        width, height = image.size
    with open(output_path, 'w', encoding='utf-8') as out_file:
        for i in range(1, len(line)):
            char = line[i]
            prev_char = line[i-1]
            if unicodedata.combining(char):
                out_file.write(f"{prev_char + char} 0 0 {width} {height} 0\n")
            elif not unicodedata.combining(prev_char):
                out_file.write(f"{prev_char} 0 0 {width} {height} 0\n")
        if not unicodedata.combining(line[-1]):
            out_file.write(f"{line[-1]} 0 0 {width} {height} 0\n")
        out_file.write(f"\t 0 0 {width} {height} 0\n")
    return True