Can't open lstm.train despite (probably) having all training tools

I'm trying to train a tesseract model on a university shared computing cluster, and am encountering a couple odd issues - one of them I think I solved, but the other I cannot figure out.

The first issue I encountered is with TESSDATA_PREFIX. When I ran make training, I encountered the following issue

tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
Error opening data file /oscar/rt/9.2/software/0.20-generic/0.20.1/opt/spack/linux-rhel9-x86_64_v3/gcc-11.3.1/tesseract-5.3.3-vq3alttswvcbt32g6ciju6qewc56rvby/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
make: *** [Makefile:278: data/foo-ground-truth/alexis_ruhe01_1852_0018_022.lstmf] Error 1

After inspecting the tessdata directory, I found that it indeed does not contain any .traineddata files. Unfortunately, I cannot install software directly on the shared computer, and instead must defer to the cluster managers. So instead, I cloned the tessdata directory and defined the environment variable as the path to the cloned tessdata directory. With this change, I no longer encountered this error. I think this is the correct way of resolving the problem, but obviously this is not the canonical way to do things, so I wanted to log it here in case this is the main issue.

The second issue is almost exactly the same as this issue. With the aforementioned patch applied, I got logs saying

unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"
Extracting unicharset from plain text file data/foo/all-gt
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Y of y is not in unicharset
Wrote unicharset file data/foo/unicharset
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
read_params_file: Can't open lstm.train
... (repeats with different files)
python3 shuffle.py 0 "data/foo/all-lstmf"
+ head -n 326 data/foo/all-lstmf
+ tail -n 37 data/foo/all-lstmf
+ '[' '' = Windows_NT ']'
make: Warning: File 'data/foo/list.eval' has modification time 0.00096 s in the future
if [ "" = "Windows_NT" ]; then \
        dos2unix "data/foo/foo.numbers"; \
        dos2unix "data/foo/foo.punc"; \
        dos2unix "data/foo/foo.wordlist"; \
        dos2unix "data/langdata/foo/foo.config"; \
fi
combine_lang_model \
  --input_unicharset data/foo/unicharset \
  --script_dir data/langdata \
  --numbers data/foo/foo.numbers \
  --puncs data/foo/foo.punc \
  --words data/foo/foo.wordlist \
  --output_dir data \
   \
  --lang foo
Failed to read data from: data/foo/foo.wordlist
Failed to read data from: data/foo/foo.punc
Failed to read data from: data/foo/foo.numbers
Loaded unicharset of size 77 from file data/foo/unicharset
Setting unichar properties
Other case I of i is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case Y of y is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Config file is optional, continuing...
Failed to read data from: data/langdata/foo/foo.config
Null char=2
Created data/foo/foo.traineddatalstmtraining \
  --debug_interval 0 \
  --traineddata data/foo/foo.traineddata \
  --learning_rate 0.002 \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c`head -n1 data/foo/unicharset`]" \
  --model_output data/foo/checkpoints/foo \
  --train_listfile data/foo/list.train \
  --eval_listfile data/foo/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Warning: given outputs 77 not equal to unicharset of 76.
Num outputs,weights in Series:
  1,36,0,1:1, 0
Num outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys48:48, 12480
  Lfx96:96, 55680
  RxLrx96:96, 74112
  Lfx192:192, 221952
  Fc76:76, 14668
Total weights = 379052
Built network:[1,36,0,1[C3,3Ft16]Mp3,3TxyLfys48Lfx96RxLrx96Lfx192Fc76] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c77]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=75
Deserialize header failed: data/foo-ground-truth/frapan_bittersuess_1891_0103_007.lstmf
Deserialize header failed: data/foo-ground-truth/clauren_liebe_1827_0105_016.lstmf
Deserialize header failed: data/foo-ground-truth/hoffmann_elixiere01_1815_0173_012.lstmf
Deserialize header failed: data/foo-ground-truth/andreas_fenitschka_1898_0066_007.lstmf
Deserialize header failed: data/foo-ground-truth/lenau_gedichte_1832_0225_006.lstmf
Deserialize header failed: data/foo-ground-truth/poersch_gewerkschaftsbewegung_1897_0032_045.lstmf
Deserialize header failed: data/foo-ground-truth/saar_novellen_1877_0283_020.lstmf
Deserialize header failed: data/foo-ground-truth/fiedler_kuenstlerische_1887_0135_015.lstmf
Deserialize header failed: data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.lstmf
Load of page 0 failed!
Load of images failed!!
make: *** [Makefile:347: data/foo/checkpoints/foo_checkpoint] Error 1

I cannot quite figure out why I am getting these errors. I do believe that all the training tools are installed, since when I inspected the bin folder under the tesseract installation, I saw all the expected training tools: ambiguous_words, combine_tessdata, merge_unicharsets, tesseract, classifier_tester, dawg2wordlist, mftraining, text2image, cntraining , lstmeval, set_unicharset_properties, unicharset_extractor, combine_lang_model, lstmtraining, shapeclustering, wordlist2dawg.

Do you believe this to be an issue with the tesseract installation or a different issue?

tesseract-ocr / tesstrain

Can't open lstm.train despite (probably) having all training tools #366