tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

Assertion when running lstmtraining #757

Open stweil opened 7 years ago

stweil commented 7 years ago

Running lstmtraining for frk language with 50000 iterations terminated with an assertion.

$ lstmtraining -U /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.unicharset --script_dir ~/src/github/tesseract-ocr/langdata --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' --model_output /home/stweil/src/github/tesseract-ocr/tesseract/frk/output/base --train_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --eval_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --max_iterations 50000
...
At iteration 15778/49900/49900, Mean rms=0.37%, delta=0.112%, char train=0.381%, word train=1.506%, skip ratio=0%,  wrote checkpoint.

At iteration 15788/50000/50000, Mean rms=0.363%, delta=0.104%, char train=0.346%, word train=1.387%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 0.26
num_docs > 0:Error:Assert failed:in file ../../../../ccstruct/imagedata.cpp, line 648

I used latest Tesseract sources, a slightly modified font list and a longer training text for frk training. A previous run with 10000 iterations and nearly the same conditions did not raise the assertion:

...
2 Percent improvement time=807, best error was 3.911 @ 8211
At iteration 9018/10000/10000, Mean rms=0.835%, delta=0.465%, char train=1.729%, word train=6.095%, skip ratio=0%,  New best char error = 1.729Deserialize failed wrote best model:/home/stweil/src/github/tesseract-ocr/tesseract/tutorial/frkoutput/base1.729_9018.lstm wrote checkpoint.

Finished! Error rate = 1.729
Shreeshrii commented 7 years ago

I have found that the error goes away when NOT using --eval_listfile, please try without the following

--eval_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt

Though this means that there is no regular eval during training.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Mar 13, 2017 at 12:42 PM, Stefan Weil notifications@github.com wrote:

Running lstmtraining for frk language with 50000 iterations terminated with an assertion.

$ lstmtraining -U /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.unicharset --script_dir ~/src/github/tesseract-ocr/langdata --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' --model_output /home/stweil/src/github/tesseract-ocr/tesseract/frk/output/base --train_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --eval_listfile /home/stweil/src/github/tesseract-ocr/tesseract/frk/train/frk.training_files.txt --max_iterations 50000 ... At iteration 15778/49900/49900, Mean rms=0.37%, delta=0.112%, char train=0.381%, word train=1.506%, skip ratio=0%, wrote checkpoint.

At iteration 15788/50000/50000, Mean rms=0.363%, delta=0.104%, char train=0.346%, word train=1.387%, skip ratio=0%, wrote checkpoint.

Finished! Error rate = 0.26 num_docs > 0:Error:Assert failed:in file ../../../../ccstruct/imagedata.cpp, line 648

I used latest Tesseract sources, a slightly modified font list and a longer training text for frk training. A previous run with 10000 iterations and nearly the same conditions did not raise the assertion:

... 2 Percent improvement time=807, best error was 3.911 @ 8211 At iteration 9018/10000/10000, Mean rms=0.835%, delta=0.465%, char train=1.729%, word train=6.095%, skip ratio=0%, New best char error = 1.729Deserialize failed wrote best model:/home/stweil/src/github/tesseract-ocr/tesseract/tutorial/frkoutput/base1.729_9018.lstm wrote checkpoint.

Finished! Error rate = 1.729

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-S-16Nb7AFsGsCO3w_L80p8t9Fuks5rlOxhgaJpZM4Ma6ed .

amitdo commented 7 years ago

A link to the relevant line in the source code:
https://github.com/tesseract-ocr/tesseract/blob/134a2537584b5bd6000841dbcb0a9489cd2548f5/ccstruct/imagedata.cpp#L648

amitdo commented 7 years ago

Maybe it's a memory problem. How much RAM do you have in that machine?

amitdo commented 7 years ago

About fonts, I believe that for LSTM training Ray used much more fonts for each language than with the old engine.

stweil commented 7 years ago

The training machine has 23 GiB RAM plus 23 GiB swap, and about 35 GiB of that memory are available for the training. Maybe some Debian GNU Linux defaults set a smaller limit for single processes, but I don't think we have a memory problem.

stweil commented 7 years ago

According to font_properties, Ray used about 6000 fonts. I used 12 fonts. The training result was pretty good for the old engine and unusable for LSTM.

amitdo commented 7 years ago

So little RAM? A University server, I guess... :-)

stweil commented 7 years ago

A server which wants to process 700000 pages of a journal printed in fraktur.

stweil commented 7 years ago

I have found that the error goes away when NOT using --eval_listfile, please try without the following ...

Yes, the assertion does not occur when I omit --eval_listfile.

Shreeshrii commented 7 years ago

On Mon, Mar 13, 2017 at 3:42 PM, Amit D. notifications@github.com wrote:

I used 12 fonts. The training result was pretty good for the old engine and unusable for LSTM.

12 fonts is not enough for LSTM. Use as much fonts as you can find.

​@stweil You can increase the number of box/tiff pairs by adding --exposures "-1 0 1" or even --exposures "-2 -1 0 1 2" ​with the same fonts to get images which are lighter and darker than the original font.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang frk  \
  --linedata_only --noextract_font_properties --exposures "-1 0 1" \
   --langdata_dir ../langdata --tessdata_dir ./tessdata \
     --output_dir ~/tesstutorial/frk

Also, please check whether the fonts you are using have support for the paragraph marker etc, otherwise they might get dropped as unrenderable.

@theraysmith I think it will be useful if training using non-synthetic box/tiff pairs is also supported for LSTM.

Thanks.

Shreeshrii commented 7 years ago

@stweil I had generated box files for different Fraktur font alphabet images using makebox. However these need to be reviewed for correctness and tabs need to be added at end of lines.

I do not recognize the letters so can't update them. jtessboxeditor could be used for adding tabs.

https://github.com/paalberti/tesseract-dan-fraktur/files/721936/fraktur-png-box-to-be-corrected.zip

stweil commented 7 years ago

@Shreeshrii, that's a nice collection of Fraktur fonts, but several of the image not even include all normal ASCII characters. All images are missing the long s character (ſ) which is very important for all Fraktur texts. Also missing are all forms of ligatures (combinations of certain characters, like for example ffi, which need a special rendering).

Shreeshrii commented 7 years ago

I know. I do not have those fonts,but found these images on the net on some font sites. If you or Ray have the resources to get these fonts, you can use them to create appropriate trainingdata.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Mar 14, 2017 at 11:16 AM, Stefan Weil notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii, that's a nice collection of Fraktur fonts, but several of the image not even include all normal ASCII characters. All images are missing the long s character (ſ) which is very important for all Fraktur texts. Also missing are all forms of ligatures (combinations of certain characters, like for example ffi, which need a special rendering).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-286328117, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_my9wQq3XxnJtuzD2QWSpwu0Pt4ks5rlimqgaJpZM4Ma6ed .

Shreeshrii commented 7 years ago

These are the freely available Fraktur fonts that I found:

FRAKTUR_FONTS=(
  "CaslonishFraxx Medium" \
  "Cloister Black, Semi-Light" \
  "Proclamate Light, Semi-Light" \
  "UnifrakturCook" \
  "UnifrakturMaguntia" \
  "UnifrakturMaguntia16" \
  "UnifrakturMaguntia17" \
  "UnifrakturMaguntia18" \
  "UnifrakturMaguntia19" \
  "UnifrakturMaguntia20" \
  "UnifrakturMaguntia21" \
    "Walbaum-Fraktur" \
)

sample files via text2image that I had posted in the issue

https://github.com/paalberti/tesseract-dan-fraktur/files/721956/frk.box-tif-pairs.zip

stweil commented 7 years ago

Did you see the new page about fonts which I added to the wiki? Maybe you want to add information there.

Shreeshrii commented 7 years ago

@stweil The page about fonts will be very useful. I have added info about Devanagari fonts and will update more later.

Please see pages 2-9 in http://www.sanskritweb.net/fontdocs/genzmer.pdf which show samples of many fraktur fonts - again it does not have ligatures and all letters - but would something like that be helpful for training/testing German Fraktur.

stweil commented 7 years ago

Many thanks, that's a really very useful document which might allow us to find the exact list of Fraktur fonts used for the German newspaper editions printed from 1900 up to 1945.

Shreeshrii commented 7 years ago

You can also check the following as well as other font related documents on website by Ulrich Stiehl. http://www.sanskritweb.net/fontdocs/gutenberg2.pdf http://www.sanskritweb.net/fontdocs/gutenberg.pdf http://www.sanskritweb.net/fontdocs/walbaum.pdf

amitdo commented 7 years ago

INT_PARAM_FLAG(max_image_MB, 6000, "Max memory to use for images.")

I don't know if it is actually related to the reported issue, but you can increase the default value from the command line.

Shreeshrii commented 7 years ago

@amitdo It is probably memory related, since the assertion does not occur when I omit --eval_listfile.

And, https://github.com/tesseract-ocr/tesseract/blob/master/training/lstmeval.cpp uses a smaller memory size for images.

INT_PARAM_FLAG(max_image_MB, 2000, "Max memory to use for images.");

How would you change it from commandline?

amitdo commented 7 years ago

The same way you do it with text2image.

You should take into account the RAM in your PC.

xlight commented 7 years ago

@Shreeshrii I changed the max_image_MB to 8000 /training/lstmeval.cpp, and complie it . then, run lstmtraining --debug_interval -1 --model_output /data/docker-tess/output/realR5 --continue_from /data/docker-tess/output/realR410.378_33695.lstm --train_listfile /data/traindata/trainAll.train_filelist.txt --eval_listfile /data/correctedBox/real.eval_filelist.txt

after 100 Iterations, it CoreDump

Mean rms=0.148%, delta=2.857%, train=12.187%(29%), skip ratio=3%
lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)
Shreeshrii commented 7 years ago

@stweil I got the same error now. Though the same files and commands worked before and after.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
2 Percent improvement time=1100, best error was 100 @ 0
At iteration 1100/1100/1100, Mean rms=0.821%, delta=44.564%, char train=99.966%, word train=100%, skip ratio=0%,  New best char error = 99.966 wrote checkpoint.
Finished! Error rate = 99.966
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
./4runtesseract.sh: line 16:  5559 Segmentation fault      (core dumped) lstmtraining --script_dir ./tess4training-save -U ./tess4training-save/bih.unicharset --continue_from ./tess4training-save/bih.lstm --train_listfile ./tess4training-save/bih.training_files.txt --eval_listfile ./tess4training-save/bih.eval_files.txt --model_output ./tess4training-save/bihlayer --append_index 5 --net_spec '[Lfx384 O1c105]' --debug_interval 0 --perfect_sample_delay 19 --max_iterations 1000

Ref: https://travis-ci.org/Shreeshrii/tess4train/builds/249914589

hanikh commented 7 years ago

@Shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?

theraysmith commented 7 years ago

It would be expected to get such a low error rate on your training set, but has it overfitted? How does it do on different test data?

On Tue, Aug 1, 2017 at 6:13 AM, hanikh notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-319365858, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056TlrhrROzIm6VOXVoiHjygnInOUGks5sTyRvgaJpZM4Ma6ed .

-- Ray.

hanikh commented 7 years ago

Dear Mr.Smith; thanks for answering my question. No, it's not overfitted. I tested it and the results were acceptable.

On Fri, Aug 4, 2017 at 4:41 AM, theraysmith notifications@github.com wrote:

It would be expected to get such a low error rate on your training set, but has it overfitted? How does it do on different test data?

On Tue, Aug 1, 2017 at 6:13 AM, hanikh notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii I am trying to fine-tune tesseract for Arabic and Persian. I have used 4000 text lines and about 40 fonts. and I set the max-error-rate=0.001. the error rate of 0.002 has been recorded. but after finishing of the training process I got error-rate=0! Is it reasonable?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757# issuecomment-319365858, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056TlrhrROzIm6VOXVoiHjygnInOUGks5sTyRvgaJpZM4Ma6ed .

-- Ray.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-320122468, or mute the thread https://github.com/notifications/unsubscribe-auth/AZFiAaUyiObmY6Hi8q10PMUhoGYA_Wolks5sUmGugaJpZM4Ma6ed .

Shreeshrii commented 6 years ago

@stweil Is this issue with training still there?

ghost commented 6 years ago

@stweil @xlight any updates regarding this issue? is it caused by lstmeval.cpp or lstmtraining.cpp?

stweil commented 6 years ago

Is this issue with training still there?

I don't know, simply because I have not run training for a while now. Has anybody still that assertion?

I just tried to reproduce the problem (with an updated command sequence) and got a different problem (integer overflow).

Shreeshrii commented 6 years ago
At iteration 13587/20000/20007, Mean rms=0.756%, delta=1.394%, char train=4.374%, word train=12.913%, skip ratio=0%,  New worst char error = 4.374
Previous test incomplete, skipping test at iteration12474 
wrote checkpoint.

Finished! Error rate = 3.044
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
Command terminated by signal 11
    Command being timed: "/home/ubuntu/tesseract/src/training/lstmtraining --model_output ./plus_from_deva/plus --continue_from ./plus_from_deva/Devanagari.lstm --old_traineddata ../tessdata_best/script/Devanagari.traineddata --traineddata ./sansample/san/san.traineddata --train_listfile ./sansample/san.training_files.txt --eval_listfile ./santest/san.training_files.txt --debug_interval -1 --max_image_MB 7000 --max_iterations 20000"

Could the error be related to

Previous test incomplete, skipping test at iteration12474

Shreeshrii commented 5 years ago

Still getting the error with latest code:

At iteration 14345/15900/15902, Mean rms=2.13%, delta=15.097%, char train=30.956%, word train=45.118%, skip ratio=0%,  wrote checkpoint.

Loaded 825/825 pages (1-825) of document ./mya-eval/mya.Myanmar_Text.exp0.lstmf
At iteration 14429/16000/16002, Mean rms=2.172%, delta=15.534%, char train=31.925%, word train=45.723%, skip ratio=0%,  New worst char error = 31.925Previous test incomplete, skipping test at iteration14171 wrote checkpoint.

Finished! Error rate = 30.315
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 650
mya.sh: line 232: 15076 Trace/breakpoint trap   (core dumped) lstmtraining --model_output $trained_output_dir/layer --continue_from $bestdata_dir/$BaseLang.lstm --append_index 5 --net_spec '[Lfx192 O1c1]' --traineddata $layer_output_dir/$Lang/$Lang.traineddata --max_iterations $LayerIterations --debug_interval $DebugInterval --eval_listfile $eval_output_dir/$Lang.training_files.txt --train_listfile $layer_output_dir/$Lang.training_files.txt

When an --eval_listfile is given with lstmtraining command, eval is also run in parallel (in sequential training mode).

I think that when lstmtraining is terminated on reaching max_iterations the eval process is not being brought to a graceful close. That seems to be the cause of this assertion.

stweil commented 5 years ago

It looks like the issues #1168 and #2191 are related. They all could be caused by missing thread synchronisation from this code:

src/ccstruct/imagedata.cpp:  SVSync::StartThread(ReCachePagesFunc, this);
zdenop commented 5 years ago

IMO this should be "reworked". I already wander why (lstm) training (which is non interactive) try to start server... (e.g. #578 or #1168) => target should be that lstmtraining must work with --disable-graphics option.

Shreeshrii commented 5 years ago

As part of training tutorial/instructions Ray has given the command to include --debug_interval 100 (see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch)

debug_interval int 0 If non-zero (and positive) , show visual debugging every this many iterations.

If the values of -1 or 0 (default) are used then visual debugger is not called.

zdenop commented 5 years ago

thanks shree. Did you try to use --debug_interval with build option --disable-graphics?

Shreeshrii commented 5 years ago

I usually build with --disable-graphics and use --debug_interval of 0 and -1 while running lstmtraining.

I have not tried lstmtraining with a non-zero positive number for --debug_interval.

stweil commented 5 years ago

The problem still exists. I just got it while running training with ocrd-train.

stweil commented 4 years ago

It looks like the issues #1168 and #2191 are related. They all could be caused by missing thread synchronisation.

All those problems will be triggered when the destructor DocumentData::~DocumentData() destroys the object while there is still another thread at the beginning of DocumentData::ReCachePages(). That thread will still work with the destroyed object and fail of course.

I can reproduce a SIGSEGV caused by this by running imagedata_test several times on some hosts:

cd unittest;
while ./imagedata_test; do true; done

On other hosts that test does not trigger the problem. Maybe those hosts are too fast.

stweil commented 4 years ago

Pull request #3016 hopefully fixes this.