Closed songzy12 closed 4 years ago
hi,
I was trying to train a new model using tesseract using my own image and labeled txt. I have been following this part of the tesseract-4.0 doc: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#making-box-files
It says:
For example, tesseract image.png image lstmbox will generate a box file with name image.box for the image in the current directory.
tesseract image.png image lstmbox
However, it failed like:
➜ data git:(master) tesseract -v tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE ➜ data git:(master) tesseract 2020-03-14-02-41-41-captcha.png 2020-03-14-02-41-41-captcha lstmbox read_params_file: Can't open lstmbox Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica ➜ data git:(master) ✗ ls 2020-03-14-02-41-41-captcha.png 2020-03-14-02-43-05-captcha.png 2020-03-14-02-43-23-result.txt 2020-03-14-02-41-41-captcha.txt 2020-03-14-02-43-05-result.txt 2020-03-14-02-43-39-captcha.png 2020-03-14-02-41-41-result.txt 2020-03-14-02-43-23-captcha.png 2020-03-14-02-43-39-result.txt
On the other hand, I found the following script: https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py
which works well with
python3 tesstrain/generate_line_box.py --image=data/2020-03-14-02-41-41-captcha.png --txt=data/2020-03-14-02-41-41-captcha.txt > data/2020-03-14-02-41-41-captcha.box
Maybe the current doc about how to train tesseract is outdated now?
4.0.0-beta.1 is old (very old) version. Your recent version of tesseract.
hi,
I was trying to train a new model using tesseract using my own image and labeled txt. I have been following this part of the tesseract-4.0 doc: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#making-box-files
It says:
However, it failed like:
On the other hand, I found the following script: https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py
which works well with
Maybe the current doc about how to train tesseract is outdated now?