Open inductiveload opened 3 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Cannot repoduce the problem. Could you please make a test without a start model? I.e. train from scratch?
Hi there, my problem is quite similar. The execution without start_model works, but when adding a start model I get a segmentation fault:
lstmtraining \
--debug_interval 0 \
--traineddata data/pdf/pdf.traineddata \
--old_traineddata /usr/share/tesseract-ocr/4.00/tessdata//eng.traineddata \
--continue_from data/eng/pdf.lstm \
--learning_rate 0.0001 \
--model_output data/pdf/checkpoints/pdf \
--train_listfile data/pdf/list.train \
--eval_listfile data/pdf/list.eval \
--max_iterations 10000 \
--target_error_rate 0.01
Loaded file data/eng/pdf.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 111!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx192:192, 221952
Fc111:111, 0
Total weights = 364384
Previous null char=110 mapped to 110
Continuing from data/eng/pdf.lstm
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00efa1bb61fb5e2acbac526cae15db47_22.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00ed3d1c5efa45cb1f159b2aea364c06_13.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00e9abbf6ae0316b26564489043309e7_28.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00d93774feb260161c699826659335eb_26.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00d93774feb260161c699826659335eb_31.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00db3bce204043f8ae6093acb10f3421_15.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00d93774feb260161c699826659335eb_24.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00cf516d14934c8cc4aced3892e8023d_9.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/00d9bc8920fad718d800d8e03e5db4a1_26.lstmf
Loaded 1/1 lines (1-1) of document data/pdf-ground-truth/0a0a3b164fb469e52d9532de17a0ca6d_15.lstmf
make: *** [Makefile:278: data/pdf/checkpoints/pdf_checkpoint] Segmentation fault
I use:
tesseract 4.1.1:
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
installed on Ubuntu 20.04.3 with apt install tesseract-ocr tesseract-ocr-eng
and the command:
make training MODEL_NAME='pdf' START_MODEL='eng' CORES=8 PSM=6 TESSDATA='/usr/share/tesseract-ocr/4.00/tessdata/'
Same Problem with the test-set 'foo':
lstmtraining \
--debug_interval 0 \
--traineddata data/foo/foo.traineddata \
--old_traineddata /usr/share/tesseract-ocr/4.00/tessdata//eng.traineddata \
--continue_from data/eng/foo.lstm \
--learning_rate 0.0001 \
--model_output data/foo/checkpoints/foo \
--train_listfile data/foo/list.train \
--eval_listfile data/foo/list.eval \
--max_iterations 10000 \
--target_error_rate 0.01
Loaded file data/eng/foo.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 119!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx192:192, 221952
Fc119:119, 0
Total weights = 364384
Previous null char=110 mapped to 118
Continuing from data/eng/foo.lstm
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/frapan_bittersuess_1891_0103_007.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/clauren_liebe_1827_0105_016.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/lenau_gedichte_1832_0225_006.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/hoffmann_elixiere01_1815_0173_012.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/andreas_fenitschka_1898_0066_007.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/poersch_gewerkschaftsbewegung_1897_0032_045.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/saar_novellen_1877_0283_020.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/raschdorff_hochbau_1880_0025_016.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/gutzkow_wally_1835_0154_008.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/fiedler_kuenstlerische_1887_0135_015.lstmf
Loaded 1/1 lines (1-1) of document data/foo-ground-truth/poersch_gewerkschaftsbewegung_1897_0020_021.lstmf
make: *** [Makefile:278: data/foo/checkpoints/foo_checkpoint] Segmentation fault
Command:
make training START_MODEL='eng' CORE=8 TESSDATA='/usr/share/tesseract-ocr/4.00/tessdata/'
I had the same problem when trying to train with the system-provided start model. After reading https://github.com/tesseract-ocr/tesseract/issues/1573, I downloaded the corresponding tessdata_best model and everything worked fine.
Arch Linux,
What I did:
tessdata_best
to~/src
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/src/tessdata_best MAX_ITERATIONS=10000
Output:
GDB of crashed lstmtraining: