tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

train tessract-ocr from images fails #206

Closed sammar80 closed 3 years ago

sammar80 commented 3 years ago

I am trying to train tessract-ocr from images I created both the .tiff images and the .gt.txt files but it fails

make training MODEL_NAME=custom_model START_MODEL=eng PSM=7 TESSDATA=C:/Program Files (x86)/Tesseract-OCR/tessdata GROUND_TRUTH_DIR=/data/custom_model/all-gt FIND: Invalid switch FIND: Invalid switch The syntax of the command is incorrect. make: *** [data/custom_model/unicharset] Error 1

kba commented 3 years ago

You're using Windows where the find command does something different than on Linux (it's more like grep).

Can you use tesstrain in WSL or VirtualBox or similar? IIRC @wrznr was testing WSL deployment.

sammar80 commented 3 years ago

Thank you I will try it on linux

sammar80 commented 3 years ago

@kba you're right it worked on linux but now I am getting this error

make training MODEL_NAME=custom START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata
find data/custom-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/custom/all-gt"
combine_tessdata -u /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata  data/eng/custom
Extracting tessdata components from /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
Wrote data/eng/custom.lstm
Wrote data/eng/custom.lstm-punc-dawg
Wrote data/eng/custom.lstm-word-dawg
Wrote data/eng/custom.lstm-number-dawg
Wrote data/eng/custom.lstm-unicharset
Wrote data/eng/custom.lstm-recoder
Wrote data/eng/custom.version
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
unicharset_extractor --output_unicharset "data/custom/my.unicharset" --norm_mode 2 "data/custom/all-gt"
Bad box coordinates in boxfile string! eZl2MW5e3sL2G8yNE4h7YepU7B4D6P6nM2i8E3vxH4z5Wxf7b7sWP5H4u6h7c6U6LY6G3upIU85BWG9j7CJs7w6WCMke6a5bH2Y5f8V9K8dHdfYcVqY8a5n7WQfr4d7vL4x6kbs7S7XAz6F9s5m34p2f3s7A9Sziy2Q5Xz5w9pH3w3s8ZVNGA9Wji6a9R2W2S3q7rVM4H2h8K2g8eH4u6h24FifNv
Extracting unicharset from plain text file data/custom/all-gt
Wrote unicharset file data/custom/my.unicharset
merge_unicharsets data/eng/custom.lstm-unicharset data/custom/my.unicharset  "data/custom/unicharset"
Loaded unicharset of size 112 from file data/eng/custom.lstm-unicharset
Loaded unicharset of size 59 from file data/custom/my.unicharset
Wrote unicharset file data/custom/unicharset.
make: *** No rule to make target 'data/custom-ground-truth/tmp1optdwrt.lstmf', needed by 'data/custom/all-lstmf'.  Stop.
kba commented 3 years ago

make: *** No rule to make target 'data/custom-ground-truth/tmp1optdwrt.lstmf', needed by 'data/custom/all-lstmf'. Stop.

I guess there is an image tmp1optdwrt.tif but no corresponding tmp1optdwrt.gt.txt?

sammar80 commented 3 years ago

I checked and it has tmp1optdwrt.gt.txt file

kba commented 3 years ago

Sry, I didn't read the log right.

Bad box coordinates in boxfile string! eZl2MW5e3sL2G8yNE4h7YepU7B4D6P6nM2i8E3vxH4z5Wxf7b7sWP5H4u6h7c6U6LY6G3upIU85BWG9j7CJs7w6WCMke6a5bH2Y5f8V9K8dHdfYcVqY8a5n7WQfr4d7vL4x6kbs7S7XAz6F9s5m34p2f3s7A9Sziy2Q5Xz5w9pH3w3s8ZVNGA9Wji6a9R2W2S3q7rVM4H2h8K2g8eH4u6h24FifNv

This is the problem. Can you post the image, .gt.txt and, if available (which it likely is not), the .box file? I guess that such a long word confuses tesseract. As a workaround, can you remove this particular image/text pair from the training set, to make sure that it's indeed this particular pair that causes the issue?

sammar80 commented 3 years ago

Thank you I did what you recommended and the image processing worked but now I got this error

!intmode:Error:Assert failed:in file weightmatrix.cpp, line 244 Makefile:266: recipe for target 'data/custom/checkpoints/custom_checkpoint' failed make: *** [data/custom/checkpoints/custom_checkpoint] Segmentation fault (core dumped)

kba commented 3 years ago

Is eng is from tessdata_fast? These models are optimized for runtime speed but do not allow fine-tuning. Try with the non-optimized model from https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata.

sammar80 commented 3 years ago

I changed it and used models from tessdata_best and everything is correct and the training finished successfully I just have one question is the generated model from lunix environment can't be used on windows because I am getting now (3221225477, '') when I added the new model on windows tesseract-ocr and try to run my program

kba commented 3 years ago

I am getting now (3221225477, '') when I added the new model on windows tesseract-ocr and try to run my program

I don't understand, what do you mean with getting (3221225477, '')? The models should be platform-independent AFAIK.

sammar80 commented 3 years ago

I am trying to use the new model and I am getting this error (3221225477, '')

kba commented 3 years ago

I see, from what I understand, that's the base-10 form of 0xc0000005 which is an access violation in Windows, i.e. a segfault. You're probably best off asking on the tesseract forum about that, since that's an issue with tesseract itself.

Shreeshrii commented 3 years ago

@sammar80 Did your model work on Linux? Does it work from command line?

sammar80 commented 3 years ago

@Shreeshrii I will try and let you know when it works