tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
640 stars 190 forks source link

Can't open lstm.train despite (probably) having all training tools #366

Open Forthoney opened 9 months ago

Forthoney commented 9 months ago

I'm trying to train a tesseract model on a university shared computing cluster, and am encountering a couple odd issues - one of them I think I solved, but the other I cannot figure out.

The first issue I encountered is with TESSDATA_PREFIX. When I ran make training, I encountered the following issue

tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
Error opening data file /oscar/rt/9.2/software/0.20-generic/0.20.1/opt/spack/linux-rhel9-x86_64_v3/gcc-11.3.1/tesseract-5.3.3-vq3alttswvcbt32g6ciju6qewc56rvby/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
make: *** [Makefile:278: data/foo-ground-truth/alexis_ruhe01_1852_0018_022.lstmf] Error 1

After inspecting the tessdata directory, I found that it indeed does not contain any .traineddata files. Unfortunately, I cannot install software directly on the shared computer, and instead must defer to the cluster managers. So instead, I cloned the tessdata directory and defined the environment variable as the path to the cloned tessdata directory. With this change, I no longer encountered this error. I think this is the correct way of resolving the problem, but obviously this is not the canonical way to do things, so I wanted to log it here in case this is the main issue.

The second issue is almost exactly the same as this issue. With the aforementioned patch applied, I got logs saying

unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"
Extracting unicharset from plain text file data/foo/all-gt
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Y of y is not in unicharset
Wrote unicharset file data/foo/unicharset
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
read_params_file: Can't open lstm.train
... (repeats with different files)
python3 shuffle.py 0 "data/foo/all-lstmf"
+ head -n 326 data/foo/all-lstmf
+ tail -n 37 data/foo/all-lstmf
+ '[' '' = Windows_NT ']'
make: Warning: File 'data/foo/list.eval' has modification time 0.00096 s in the future
if [ "" = "Windows_NT" ]; then \
        dos2unix "data/foo/foo.numbers"; \
        dos2unix "data/foo/foo.punc"; \
        dos2unix "data/foo/foo.wordlist"; \
        dos2unix "data/langdata/foo/foo.config"; \
fi
combine_lang_model \
  --input_unicharset data/foo/unicharset \
  --script_dir data/langdata \
  --numbers data/foo/foo.numbers \
  --puncs data/foo/foo.punc \
  --words data/foo/foo.wordlist \
  --output_dir data \
   \
  --lang foo
Failed to read data from: data/foo/foo.wordlist
Failed to read data from: data/foo/foo.punc
Failed to read data from: data/foo/foo.numbers
Loaded unicharset of size 77 from file data/foo/unicharset
Setting unichar properties
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Y of y is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Config file is optional, continuing...
Failed to read data from: data/langdata/foo/foo.config
Null char=2
Created data/foo/foo.traineddatalstmtraining \
  --debug_interval 0 \
  --traineddata data/foo/foo.traineddata \
  --learning_rate 0.002 \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c`head -n1 data/foo/unicharset`]" \
  --model_output data/foo/checkpoints/foo \
  --train_listfile data/foo/list.train \
  --eval_listfile data/foo/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Warning: given outputs 77 not equal to unicharset of 76.
Num outputs,weights in Series:
  1,36,0,1:1, 0
Num outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys48:48, 12480
  Lfx96:96, 55680
  RxLrx96:96, 74112
  Lfx192:192, 221952
  Fc76:76, 14668
Total weights = 379052
Built network:[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc76] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c77]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=75
Deserialize header failed: data/foo-ground-truth/frapan_bittersuess_1891_0103_007.lstmf
Deserialize header failed: data/foo-ground-truth/clauren_liebe_1827_0105_016.lstmf
Deserialize header failed: data/foo-ground-truth/hoffmann_elixiere01_1815_0173_012.lstmf
Deserialize header failed: data/foo-ground-truth/andreas_fenitschka_1898_0066_007.lstmf
Deserialize header failed: data/foo-ground-truth/lenau_gedichte_1832_0225_006.lstmf
Deserialize header failed: data/foo-ground-truth/poersch_gewerkschaftsbewegung_1897_0032_045.lstmf
Deserialize header failed: data/foo-ground-truth/saar_novellen_1877_0283_020.lstmf
Deserialize header failed: data/foo-ground-truth/fiedler_kuenstlerische_1887_0135_015.lstmf
Deserialize header failed: data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.lstmf
Load of page 0 failed!
Load of images failed!!
make: *** [Makefile:347: data/foo/checkpoints/foo_checkpoint] Error 1

I cannot quite figure out why I am getting these errors. I do believe that all the training tools are installed, since when I inspected the bin folder under the tesseract installation, I saw all the expected training tools: ambiguous_words, combine_tessdata, merge_unicharsets, tesseract, classifier_tester, dawg2wordlist, mftraining, text2image, cntraining , lstmeval, set_unicharset_properties, unicharset_extractor, combine_lang_model, lstmtraining, shapeclustering, wordlist2dawg.

Do you believe this to be an issue with the tesseract installation or a different issue?

zdenop commented 9 months ago
  1. Please do not try to train Tesseract unless you are not familiar with tesseract.
  2. Do not try to train tesseract on the cloud or some external service unless you can train tesseract on alocal machine (e.g. you can replicate the problem and find the solution on a local machine)
  3. Make sure you can use/train ocrd-testset.zip without error.
  4. The current make training process does not correctly handle errors/missing dependencies (e.g. bc) - you need to always log the whole training process to find the reason (there is intention to rewrite it, but we lack resources (time)). Try to check&delete for zero-length files.