tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Error while training on sample data. #325

Closed tevzselcan closed 1 year ago

tevzselcan commented 1 year ago

I tried to train with the sample data provided here (https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip). I extracted the contents to data/foo-ground-truth and ran make training but got this error:

lstmtraining \
  --debug_interval 0 \
  --traineddata data/foo/foo.traineddata \
  --learning_rate 0.002 \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 O1c`head -n1 data/foo/unicharset`]" \
  --model_output data/foo/checkpoints/foo \
  --train_listfile data/foo/list.train \
  --eval_listfile data/foo/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Failed to load list of training filenames from data/foo/list.train
make: *** [Makefile:326: data/foo/checkpoints/foo_checkpoint] Error 1

I'm running this on Ubuntu 20.04.

zdenop commented 1 year ago

... which indicates you ignored previous errors (=> list.train was not created) IMO Makefile should stop/fails soon in cases of previous errors...

tevzselcan commented 1 year ago

Ok so I redid everything and got this when running make training once https://pastebin.com/0uZGxPEB but still get the error at the end

python3 shuffle.py 0 "data/foo/all-lstmf"
/bin/bash: line 1: bc: command not found
/bin/bash: line 4: bc: command not found
+ head -n '' data/foo/all-lstmf
head: invalid number of lines: ''
+ tail -n '' data/foo/all-lstmf
tail: invalid number of lines: ''
make: *** [Makefile:191: data/foo/list.train] Error 1
zdenop commented 1 year ago

please install bc (basic calculator) - I will put it into Readme.md

tevzselcan commented 1 year ago

Thank you! That worked. Just two quick questions though, would it be possible to train Tesseract to recognize symbols like Ω, α etc. so the symbols that appear quite often in physics, and could Tesseract be trained using only letters, so the transcriptions would be like A a B b C d... ?

zdenop commented 1 year ago

See part "adding the plus-minus sign (±) to the existing English model". Even it is not mentioned in tesseract 5 training, the process described there should work.