tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Bad box coordinates in boxfile string! #338

Open khashashin opened 1 year ago

khashashin commented 1 year ago

I have prepared the following ground truth files:

../tesstrain/data/Chechen-ground-truth
|-- 1.box
|-- 1.gt.txt
|-- 1.png
|-- 10.box
|-- 10.gt.txt
|-- 10.png
|-- 11.box
|-- 11.gt.txt
|-- 11.png
|-- 12.box
|-- 12.gt.txt
|-- 12.png

The box files are based on WordStr, here is the content of the file 1.box for example:

WordStr 65 61 1556 254  0   #НЕКЪАШ А
    65 61 1556 254  0

In the file 1.gt.txt I then have the corresponding text:

НЕКЪАШ А

And here is the image:

image

Running the command make training MODEL_NAME=Chechen START_MODEL=rus TESSDATA=../tesseract/tessdata, gives me an Error:

set -x; \
tesseract "data/Chechen-ground-truth/1.png" data/Chechen-ground-truth/1 --psm 13 lstm.train
+ tesseract data/Chechen-ground-truth/1.png data/Chechen-ground-truth/1 --psm 13 lstm.train
Bad box coordinates in boxfile string!  65 61 1556 254  0
No block overlapping textline: НЕКЪАШ А
Failed to read pages from data/Chechen-ground-truth/1.png
Error during processing.
make: *** [Makefile:258: data/Chechen-ground-truth/1.lstmf] Error 1

I'm usin tesseract version 5.3.0

zdenop commented 1 year ago

Please have a look at https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip how to prepare custom data for training.

khashashin commented 1 year ago

@zdenop thanks for your reply, this data does not provide any box files at all, how does tesseract know which character is which?

zdenop commented 1 year ago

Did you try to follow the instructions on https://github.com/tesseract-ocr/tesstrain/? As far as I see there is no instruction about creating box files ;-)

khashashin commented 1 year ago

@zdenop Thanks, after I removed the *.box files from the Ground Truth folder, the training could start, but the first step (stage) of the training (tesstrain-script) was to create the box files. And the coordinates look wired to me. Here is the example of a box file that tesstrain generated for me:

Н 0 0 209 43 0
Е 0 0 209 43 0
К 0 0 209 43 0
Ъ 0 0 209 43 0
А 0 0 209 43 0
Ш 0 0 209 43 0
  0 0 209 43 0
А 0 0 209 43 0
     0 0 209 43 0

This was generated for the following image: image

And I only put the files *.png and *.gt.txt in the Ground Truth folder, my 1.gt.txt content was:

НЕКЪАШ А

I just wonder how it works and if there is an article about this process, I have not found anything about version 5 and it seems relatively new, right? But there are a lot of tutorials and examples for version 4, but they are different and the process is also different.

p.s. the model created after the training was able to recognize characters it did not recognize before the training (I just used the model rus.traindata before and trained it further)

zdenop commented 1 year ago

Did you read and follow https://github.com/tesseract-ocr/tesstrain? Where is written that the first stage is to create box files?

khashashin commented 1 year ago

Did you read and follow https://github.com/tesseract-ocr/tesstrain?

yes

Where is written that the first stage is to create box files?

@zdenop no, tesstrain first created the *.box files itself and it is not mentioned in tesstrain's readme.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.