Open stweil opened 7 years ago
I have found that the error goes away when NOT using --eval_listfile, please try without the following
--eval_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt
Though this means that there is no regular eval during training.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Mar 13, 2017 at 12:42 PM, Stefan Weil notifications@github.com wrote:
Running lstmtraining for frk language with 50000 iterations terminated with an assertion.
$ lstmtraining -U /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.unicharset --script_dir ~/src/github/tesseract-ocr/langdata --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' --model_output /home/stweil/src/github/tesseract-ocr/tesseract/frk/output/base --train_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --eval_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --max_iterations 50000 ... At iteration 15778/49900/49900, Mean rms=0.37%, delta=0.112%, char train=0.381%, word train=1.506%, skip ratio=0%, wrote checkpoint.
At iteration 15788/50000/50000, Mean rms=0.363%, delta=0.104%, char train=0.346%, word train=1.387%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 0.26 num_docs > 0:Error:Assert failed:in file ../../../../ccstruct/imagedata.cpp, line 648
I used latest Tesseract sources, a slightly modified font list and a longer training text for frk training. A previous run with 10000 iterations and nearly the same conditions did not raise the assertion:
... 2 Percent improvement time=807, best error was 3.911 @ 8211 At iteration 9018/10000/10000, Mean rms=0.835%, delta=0.465%, char train=1.729%, word train=6.095%, skip ratio=0%, New best char error = 1.729Deserialize failed wrote best model:/home/stweil/src/github/tesseract-ocr/tesseract/tutorial/frkoutput/base1.729_9018.lstm wrote checkpoint.
Finished! Error rate = 1.729
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-S-16Nb7AFsGsCO3w_L80p8t9Fuks5rlOxhgaJpZM4Ma6ed .
A link to the relevant line in the source code:
https://github.com/tesseract-ocr/tesseract/blob/134a2537584b5bd6000841dbcb0a9489cd2548f5/ccstruct/imagedata.cpp#L648
Maybe it's a memory problem. How much RAM do you have in that machine?
About fonts, I believe that for LSTM training Ray used much more fonts for each language than with the old engine.
The training machine has 23 GiB RAM plus 23 GiB swap, and about 35 GiB of that memory are available for the training. Maybe some Debian GNU Linux defaults set a smaller limit for single processes, but I don't think we have a memory problem.
According to font_properties, Ray used about 6000 fonts. I used 12 fonts. The training result was pretty good for the old engine and unusable for LSTM.
So little RAM? A University server, I guess... :-)
A server which wants to process 700000 pages of a journal printed in fraktur.
I have found that the error goes away when NOT using --eval_listfile, please try without the following ...
Yes, the assertion does not occur when I omit --eval_listfile
.
On Mon, Mar 13, 2017 at 3:42 PM, Amit D. notifications@github.com wrote:
I used 12 fonts. The training result was pretty good for the old engine and unusable for LSTM.
12 fonts is not enough for LSTM. Use as much fonts as you can find.
@stweil You can increase the number of box/tiff pairs by adding --exposures "-1 0 1" or even --exposures "-2 -1 0 1 2" with the same fonts to get images which are lighter and darker than the original font.
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang frk \
--linedata_only --noextract_font_properties --exposures "-1 0 1" \
--langdata_dir ../langdata --tessdata_dir ./tessdata \
--output_dir ~/tesstutorial/frk
Also, please check whether the fonts you are using have support for the paragraph marker etc, otherwise they might get dropped as unrenderable.
@theraysmith I think it will be useful if training using non-synthetic box/tiff pairs is also supported for LSTM.
Thanks.
@stweil I had generated box files for different Fraktur font alphabet images using makebox. However these need to be reviewed for correctness and tabs need to be added at end of lines.
I do not recognize the letters so can't update them. jtessboxeditor could be used for adding tabs.
https://github.com/paalberti/tesseract-dan-fraktur/files/721936/fraktur-png-box-to-be-corrected.zip
@Shreeshrii, that's a nice collection of Fraktur fonts, but several of the image not even include all normal ASCII characters. All images are missing the long s character (ſ) which is very important for all Fraktur texts. Also missing are all forms of ligatures (combinations of certain characters, like for example ffi, which need a special rendering).
I know. I do not have those fonts,but found these images on the net on some font sites. If you or Ray have the resources to get these fonts, you can use them to create appropriate trainingdata.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Mar 14, 2017 at 11:16 AM, Stefan Weil notifications@github.com wrote:
@Shreeshrii https://github.com/Shreeshrii, that's a nice collection of Fraktur fonts, but several of the image not even include all normal ASCII characters. All images are missing the long s character (ſ) which is very important for all Fraktur texts. Also missing are all forms of ligatures (combinations of certain characters, like for example ffi, which need a special rendering).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-286328117, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_my9wQq3XxnJtuzD2QWSpwu0Pt4ks5rlimqgaJpZM4Ma6ed .
These are the freely available Fraktur fonts that I found:
FRAKTUR_FONTS=(
"CaslonishFraxx Medium" \
"Cloister Black, Semi-Light" \
"Proclamate Light, Semi-Light" \
"UnifrakturCook" \
"UnifrakturMaguntia" \
"UnifrakturMaguntia16" \
"UnifrakturMaguntia17" \
"UnifrakturMaguntia18" \
"UnifrakturMaguntia19" \
"UnifrakturMaguntia20" \
"UnifrakturMaguntia21" \
"Walbaum-Fraktur" \
)
sample files via text2image that I had posted in the issue
https://github.com/paalberti/tesseract-dan-fraktur/files/721956/frk.box-tif-pairs.zip
Did you see the new page about fonts which I added to the wiki? Maybe you want to add information there.
@stweil The page about fonts will be very useful. I have added info about Devanagari fonts and will update more later.
Please see pages 2-9 in http://www.sanskritweb.net/fontdocs/genzmer.pdf which show samples of many fraktur fonts - again it does not have ligatures and all letters - but would something like that be helpful for training/testing German Fraktur.
Many thanks, that's a really very useful document which might allow us to find the exact list of Fraktur fonts used for the German newspaper editions printed from 1900 up to 1945.
You can also check the following as well as other font related documents on website by Ulrich Stiehl. http://www.sanskritweb.net/fontdocs/gutenberg2.pdf http://www.sanskritweb.net/fontdocs/gutenberg.pdf http://www.sanskritweb.net/fontdocs/walbaum.pdf
INT_PARAM_FLAG(max_image_MB, 6000, "Max memory to use for images.")
I don't know if it is actually related to the reported issue, but you can increase the default value from the command line.
@amitdo It is probably memory related, since the assertion does not occur when I omit --eval_listfile.
And, https://github.com/tesseract-ocr/tesseract/blob/master/training/lstmeval.cpp uses a smaller memory size for images.
INT_PARAM_FLAG(max_image_MB, 2000, "Max memory to use for images.");
How would you change it from commandline?
The same way you do it with text2image.
You should take into account the RAM in your PC.
@Shreeshrii I changed the max_image_MB to 8000 /training/lstmeval.cpp, and complie it .
then, run
lstmtraining --debug_interval -1 --model_output /data/docker-tess/output/realR5 --continue_from /data/docker-tess/output/realR410.378_33695.lstm --train_listfile /data/traindata/trainAll.train_filelist.txt --eval_listfile /data/correctedBox/real.eval_filelist.txt
after 100 Iterations, it CoreDump
Mean rms=0.148%, delta=2.857%, train=12.187%(29%), skip ratio=3%
lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)
@stweil I got the same error now. Though the same files and commands worked before and after.
Warning: LSTMTrainer deserialized an LSTMRecognizer!
2 Percent improvement time=1100, best error was 100 @ 0
At iteration 1100/1100/1100, Mean rms=0.821%, delta=44.564%, char train=99.966%, word train=100%, skip ratio=0%, New best char error = 99.966 wrote checkpoint.
Finished! Error rate = 99.966
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
./4runtesseract.sh: line 16: 5559 Segmentation fault (core dumped) lstmtraining --script_dir ./tess4training-save -U ./tess4training-save/bih.unicharset --continue_from ./tess4training-save/bih.lstm --train_listfile ./tess4training-save/bih.training_files.txt --eval_listfile ./tess4training-save/bih.eval_files.txt --model_output ./tess4training-save/bihlayer --append_index 5 --net_spec '[Lfx384 O1c105]' --debug_interval 0 --perfect_sample_delay 19 --max_iterations 1000
Ref: https://travis-ci.org/Shreeshrii/tess4train/builds/249914589
@Shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?
It would be expected to get such a low error rate on your training set, but has it overfitted? How does it do on different test data?
On Tue, Aug 1, 2017 at 6:13 AM, hanikh notifications@github.com wrote:
@Shreeshrii https://github.com/shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-319365858, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056TlrhrROzIm6VOXVoiHjygnInOUGks5sTyRvgaJpZM4Ma6ed .
-- Ray.
Dear Mr.Smith; thanks for answering my question. No, it's not overfitted. I tested it and the results were acceptable.
On Fri, Aug 4, 2017 at 4:41 AM, theraysmith notifications@github.com wrote:
It would be expected to get such a low error rate on your training set, but has it overfitted? How does it do on different test data?
On Tue, Aug 1, 2017 at 6:13 AM, hanikh notifications@github.com wrote:
@Shreeshrii https://github.com/shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757# issuecomment-319365858, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056TlrhrROzIm6VOXVoiHjygnInOUGks5sTyRvgaJpZM4Ma6ed .
-- Ray.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-320122468, or mute the thread https://github.com/notifications/unsubscribe-auth/AZFiAaUyiObmY6Hi8q10PMUhoGYA_Wolks5sUmGugaJpZM4Ma6ed .
@stweil Is this issue with training still there?
@stweil @xlight any updates regarding this issue?
is it caused by lstmeval.cpp
or lstmtraining.cpp
?
Is this issue with training still there?
I don't know, simply because I have not run training for a while now. Has anybody still that assertion?
I just tried to reproduce the problem (with an updated command sequence) and got a different problem (integer overflow).
At iteration 13587/20000/20007, Mean rms=0.756%, delta=1.394%, char train=4.374%, word train=12.913%, skip ratio=0%, New worst char error = 4.374
Previous test incomplete, skipping test at iteration12474
wrote checkpoint.
Finished! Error rate = 3.044
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
Command terminated by signal 11
Command being timed: "/home/ubuntu/tesseract/src/training/lstmtraining --model_output ./plus_from_deva/plus --continue_from ./plus_from_deva/Devanagari.lstm --old_traineddata ../tessdata_best/script/Devanagari.traineddata --traineddata ./sansample/san/san.traineddata --train_listfile ./sansample/san.training_files.txt --eval_listfile ./santest/san.training_files.txt --debug_interval -1 --max_image_MB 7000 --max_iterations 20000"
Could the error be related to
Previous test incomplete, skipping test at iteration12474
Still getting the error with latest code:
At iteration 14345/15900/15902, Mean rms=2.13%, delta=15.097%, char train=30.956%, word train=45.118%, skip ratio=0%, wrote checkpoint.
Loaded 825/825 pages (1-825) of document ./mya-eval/mya.Myanmar_Text.exp0.lstmf
At iteration 14429/16000/16002, Mean rms=2.172%, delta=15.534%, char train=31.925%, word train=45.723%, skip ratio=0%, New worst char error = 31.925Previous test incomplete, skipping test at iteration14171 wrote checkpoint.
Finished! Error rate = 30.315
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
mya.sh: line 232: 15076 Trace/breakpoint trap (core dumped) lstmtraining --model_output $trained_output_dir/layer --continue_from $bestdata_dir/$BaseLang.lstm --append_index 5 --net_spec '[Lfx192 O1c1]' --traineddata $layer_output_dir/$Lang/$Lang.traineddata --max_iterations $LayerIterations --debug_interval $DebugInterval --eval_listfile $eval_output_dir/$Lang.training_files.txt --train_listfile $layer_output_dir/$Lang.training_files.txt
When an --eval_listfile
is given with lstmtraining
command, eval is also run in parallel (in sequential training mode).
I think that when lstmtraining
is terminated on reaching max_iterations
the eval process is not being brought to a graceful close. That seems to be the cause of this assertion.
It looks like the issues #1168 and #2191 are related. They all could be caused by missing thread synchronisation from this code:
src/ccstruct/imagedata.cpp: SVSync::StartThread(ReCachePagesFunc, this);
IMO this should be "reworked". I already wander why (lstm) training (which is non interactive) try to start server... (e.g. #578 or #1168) => target should be that lstmtraining must work with --disable-graphics option.
As part of training tutorial/instructions Ray has given the command to include --debug_interval 100
(see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch)
debug_interval | int | 0 | If non-zero (and positive) , show visual debugging every this many iterations. |
---|
If the values of -1 or 0 (default) are used then visual debugger is not called.
thanks shree. Did you try to use --debug_interval
with build option --disable-graphics
?
I usually build with --disable-graphics and use --debug_interval of 0 and -1 while running lstmtraining.
I have not tried lstmtraining with a non-zero positive number for --debug_interval.
The problem still exists. I just got it while running training with ocrd-train.
It looks like the issues #1168 and #2191 are related. They all could be caused by missing thread synchronisation.
All those problems will be triggered when the destructor DocumentData::~DocumentData()
destroys the object while there is still another thread at the beginning of DocumentData::ReCachePages()
. That thread will still work with the destroyed object and fail of course.
I can reproduce a SIGSEGV
caused by this by running imagedata_test
several times on some hosts:
cd unittest;
while ./imagedata_test; do true; done
On other hosts that test does not trigger the problem. Maybe those hosts are too fast.
Pull request #3016 hopefully fixes this.
Running lstmtraining for frk language with 50000 iterations terminated with an assertion.
I used latest Tesseract sources, a slightly modified font list and a longer training text for frk training. A previous run with 10000 iterations and nearly the same conditions did not raise the assertion: