paalberti / tesseract-dan-fraktur

Tesseract ocr training data for Danish written in fraktur script and a few other languages
Other
17 stars 9 forks source link

dan_frak/buildscript.sh fails with tesseract 4 alpha #3

Open tokee opened 7 years ago

tokee commented 7 years ago

I would very much like to try out the new tesseract 4 alpha LSTM with fraktur, but cannot find any trained fraktur models anywhere. So I tried running buildscript.sh in dan_frak, but got a lot of errors and a 691 byte dan_frak.traineddata. Same story with deu_frak.

dan_frak.embedsiver.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.embedsiver2.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.embedsiver3.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.enteneller.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font1.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font10.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font11.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font12.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font13.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font14.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Detected 23 diacritics
dan_frak.font15.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font16.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font17.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Detected 14 diacritics
dan_frak.font18.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font19.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font2.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font3.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font4.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.font5.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_2.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_3.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Detected 56 diacritics
dan_frak.fontfile_4.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_5.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_6.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_7.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.fontfile_8.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.nyeeventyr.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.p1020096.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
dan_frak.paakierkegaardske.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
dan_frak.poulmmollerefterladteskrifter.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 40311344 dpi. Using 70 instead.
dan_frak.tilselvprøvelse.exp1.tif:
read_params_file: Can't open nobatch
read_params_file: Can't open box.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 12755968 dpi. Using 70 instead.
Extracting unicharset from dan_frak.embedsiver.exp1.box
Extracting unicharset from dan_frak.embedsiver2.exp1.box
Extracting unicharset from dan_frak.embedsiver3.exp1.box
Extracting unicharset from dan_frak.enteneller.exp1.box
Extracting unicharset from dan_frak.font1.exp1.box
Extracting unicharset from dan_frak.font10.exp1.box
Extracting unicharset from dan_frak.font11.exp1.box
Extracting unicharset from dan_frak.font12.exp1.box
Extracting unicharset from dan_frak.font13.exp1.box
Extracting unicharset from dan_frak.font14.exp1.box
Extracting unicharset from dan_frak.font15.exp1.box
Extracting unicharset from dan_frak.font16.exp1.box
Extracting unicharset from dan_frak.font17.exp1.box
Extracting unicharset from dan_frak.font18.exp1.box
Extracting unicharset from dan_frak.font19.exp1.box
Extracting unicharset from dan_frak.font2.exp1.box
Extracting unicharset from dan_frak.font3.exp1.box
Extracting unicharset from dan_frak.font4.exp1.box
Extracting unicharset from dan_frak.font5.exp1.box
Extracting unicharset from dan_frak.fontfile_2.exp1.box
Extracting unicharset from dan_frak.fontfile_3.exp1.box
Extracting unicharset from dan_frak.fontfile_4.exp1.box
Extracting unicharset from dan_frak.fontfile_5.exp1.box
Extracting unicharset from dan_frak.fontfile_6.exp1.box
Extracting unicharset from dan_frak.fontfile_7.exp1.box
Extracting unicharset from dan_frak.fontfile_8.exp1.box
Extracting unicharset from dan_frak.nyeeventyr.exp1.box
Extracting unicharset from dan_frak.p1020096.exp1.box
Extracting unicharset from dan_frak.paakierkegaardske.exp1.box
Extracting unicharset from dan_frak.poulmmollerefterladteskrifter.exp1.box
Extracting unicharset from dan_frak.tilselvprøvelse.exp1.box
Wrote unicharset file ./unicharset.
Reading *.tr ...

Error: Unable to open *.tr!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75
Segmentation fault (core dumped)
Warning: No shape table file present: shapetable
Reading *.tr ...

Error: Unable to open *.tr!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75
Segmentation fault (core dumped)
Reading *.tr ...

Error: Unable to open *.tr!
"Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75
Segmentation fault (core dumped)
mv: cannot stat 'inttemp': No such file or directory
mv: cannot stat 'normproto': No such file or directory
mv: cannot stat 'pffmtable': No such file or directory
mv: cannot stat 'shapetable': No such file or directory
Loading unicharset from 'dan_frak.unicharset'
Failed to load unicharset from 'dan_frak.unicharset'
Loading unicharset from 'dan_frak.unicharset'
Failed to load unicharset from 'dan_frak.unicharset'
Loading unicharset from 'dan_frak.unicharset'
Failed to load unicharset from 'dan_frak.unicharset'
Loading unicharset from 'dan_frak.unicharset'
Failed to load unicharset from 'dan_frak.unicharset'
Combining tessdata files
Error: traineddata file must contain at least (a unicharset fileand inttemp) OR an lstm file.
Error combining tessdata files into dan_frak.traineddata
Shreeshrii commented 7 years ago

@tokee

buildscript.sh is setup for training for 3.0x - for 4.0 training you have to use tesstrain.sh script given at https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh

The LSTM training process is currently only been described for synthetic images created by text2image program - not for pre-existing box-tiff pairs.

you can take a look at frk.traineddata and langdata.

Shreeshrii commented 7 years ago

If you want to use these box/tiff pairs, you will need to modify the box files, adding a new box with tab character at the end of the line.

See attached file as a sample. frk.embedsiver.exp0.box.txt

Shreeshrii commented 7 years ago

You can use a boxeditor, such as Jtessboxeditor to do so.

tokee commented 7 years ago

Just a heads-up: Thank you for your help. Work dictates that I spend the next week on other things, but I'll get back to tesseract after that.

AviFix commented 5 years ago

Hi,

Any update?

How can I use the files with the latest tesseract version?

tokee commented 5 years ago

Sorry, my priorities were shifted. OCR is now "sometime later this year". No guarantee they won't be shifted again.

paalberti commented 5 years ago

AviFix, just in case you are not aware of it, I would like to note that you can use the traineddata files generated with tesseract 3 just fine with tesseract 4, so you are not left completely in the dark.

This issue would be solved by running the training process with the latest ocr engine (LSTM), which means starting over with a new set of files and a different approach. I have also been intending to look at this, but I'm not actively working on it and can't offer any timeline.

Shreeshrii commented 5 years ago

you can use the traineddata files generated with tesseract 3 just fine with tesseract 4

use --oem 0