tesseract-ocr / tessdoc

Tesseract documentation
https://tesseract-ocr.github.io/tessdoc/
1.85k stars 364 forks source link

Can not generate .box file following the docs. #7

Closed songzy12 closed 4 years ago

songzy12 commented 4 years ago

hi,

I was trying to train a new model using tesseract using my own image and labeled txt. I have been following this part of the tesseract-4.0 doc: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#making-box-files

It says:

For example, tesseract image.png image lstmbox will generate a box file with name image.box for the image in the current directory.

However, it failed like:


➜  data git:(master) tesseract -v
tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE
➜  data git:(master) tesseract 2020-03-14-02-41-41-captcha.png 2020-03-14-02-41-41-captcha lstmbox
read_params_file: Can't open lstmbox
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
➜  data git:(master) ✗ ls
2020-03-14-02-41-41-captcha.png  2020-03-14-02-43-05-captcha.png  2020-03-14-02-43-23-result.txt
2020-03-14-02-41-41-captcha.txt  2020-03-14-02-43-05-result.txt   2020-03-14-02-43-39-captcha.png
2020-03-14-02-41-41-result.txt   2020-03-14-02-43-23-captcha.png  2020-03-14-02-43-39-result.txt

On the other hand, I found the following script: https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py

which works well with

python3 tesstrain/generate_line_box.py --image=data/2020-03-14-02-41-41-captcha.png --txt=data/2020-03-14-02-41-41-captcha.txt > data/2020-03-14-02-41-41-captcha.box

Maybe the current doc about how to train tesseract is outdated now?

zdenop commented 4 years ago

4.0.0-beta.1 is old (very old) version. Your recent version of tesseract.