sonurakpinar / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Segfault normmatch.cpp:118 for some images #755

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When using a .traineddata file with a large .unicharambigs file, I get 
reproducible segfaults for some images (though most work fine).

I believe it is due to large .unicharambigs file, as removing it from the 
training file causes the tesseract process to complete successfully. By 
'large', I'm referring to the .unicharambigs in my grc training, which is 35751 
lines long (828k).

What steps will reproduce the problem?
1. copy the attached grc.traineddata to the TESSDATA directory
2. copy the attached test.png to the current directory
3. tesseract test.png out -l grc

$ tesseract test.png out -l grc
Tesseract Open Source OCR Engine v3.02 with Leptonica
Segmentation fault (core dumped

I'll attach the test image, training data, and gdb full backtrace. I have the 
core file, but it's 500MB. I can upload it to my webspace if it's useful.

This is with tesseract r739, and was present in earlier revisions of 3.02 as 
well.

Original issue reported on code.google.com by nick.wh...@durham.ac.uk on 13 Sep 2012 at 10:49

Attachments:

GoogleCodeExporter commented 9 years ago
For comparison I'm attaching a scan similar to the above test.png, for which 
tesseract finishes without issue. The image in question is the next page of a 
scanned book, which has gone through an identical cleanup pipeline.

Original comment by nick.wh...@durham.ac.uk on 13 Sep 2012 at 1:49

Attachments:

GoogleCodeExporter commented 9 years ago
It looks like the same bug may have also been tripped when Zdenko was fiddling 
with user-patterns; see 
http://groups.google.com/group/tesseract-ocr/msg/0488896e73a61f52

Original comment by nick.wh...@durham.ac.uk on 23 Oct 2012 at 3:35

GoogleCodeExporter commented 9 years ago
Nick,
grc.traineddata attached to this issue causes segfault as you reported.
But if I use grc.traineddata from svn (you last version) it works for me.
Do you know what you have changed?

Is this issue still valid? If yes, can you provide image that will produce 
crash with the svn grc.traineddata?

Original comment by zde...@gmail.com on 9 Nov 2012 at 10:02

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Zdenko,
Thanks for looking into this. There are two differences between the attached 
.traineddata and the svn one. The svn one has more .unicharambigs rules (about 
550 more), and the svn one doesn't have the 
'language_model_penalty_non_freq_dict_word' or 
'language_model_penalty_non_dict_word' lines in the .config.

I investigated and found that removing the line 
"language_model_penalty_non_dict_word 0.3" from the .config stopped the crash. 
Alternatively, entirely deleting the .unicharambigs file also stopped the crash.

So there must be some weird code bug somewhere around these. Let me know if I 
can be of any more assistance.

Original comment by nick.wh...@durham.ac.uk on 10 Nov 2012 at 5:29

GoogleCodeExporter commented 9 years ago
can you try to find out if there is some specific rule that cause problem?

Original comment by zde...@gmail.com on 10 Nov 2012 at 7:06

GoogleCodeExporter commented 9 years ago
I don't think there can be a single rule causing it, as the svn version has all 
of the rules in the older version (plus a few extra), but doesn't crash.

Original comment by nick.wh...@durham.ac.uk on 11 Nov 2012 at 2:19

GoogleCodeExporter commented 9 years ago
One additional finding to your comments #4:
When I turn on classify debuger (classify_debug_level 1) I found out that 
tesseract crash during processing words "codd.:" and "corr.". When I remove 
them, there was no crash (see test1.png) ;-)

So it looks like crash is caused with there is non dict word. Interesting is 
that there is no crash for word "codd." 2 lines about  ;-)

Original comment by zde...@gmail.com on 9 Jan 2013 at 10:58

Attachments:

GoogleCodeExporter commented 9 years ago
Good news! I found the fix for the segfault, with a little help from valgrind & 
gdb :)

Attached is a patch that fixes it.

Basically Classify::ComputeNormMatch when called from 
Classify::ComputeIntCharNormArray is called with a ClassId from 
PreTrainedTemplates. However, I NormProtos can have a different set of 
ClassIds, so this can cause reads outside the NormProtos->Protos array. This 
patch just treats any invalid ClassIds as NO_CLASS.

I'm not exactly sure what ComputeIntCharNormArray is doing, using 
PreTrainedTemplates, so it would be good to have that section looked over by 
someone more familiar with the codebase, but I'm reasonably confident this 
patch is reasonable.

Original comment by nick.wh...@durham.ac.uk on 29 Mar 2013 at 1:08

Attachments:

GoogleCodeExporter commented 9 years ago
I and a friend of mine who also hit segfaults often using the Ancient Greek 
training have both been happily segfault free for the last month since applying 
the attached patch.

So someone should look it over and apply it ;)

Original comment by nick.wh...@durham.ac.uk on 1 May 2013 at 3:44

GoogleCodeExporter commented 9 years ago
committed as r839   

Original comment by zde...@gmail.com on 2 May 2013 at 8:02

GoogleCodeExporter commented 9 years ago
Great, thanks Zdenko. It would still be good for Ray to look over it, as the 
code it concerns looks a bit dodgy to me, but it's less dodgy with my patch 
than it was before ;)

Original comment by nick.wh...@durham.ac.uk on 2 May 2013 at 9:09

GoogleCodeExporter commented 9 years ago
In my case I had to slightly modify the patch. Instead of:

  if(ClassId > NormProtos->NumProtos) {

I had to use:

  if(ClassId >= NormProtos->NumProtos) {

to prevent the segmentation fault.

I'm using 3.02.02 with a user-patterns file. Segmentation fault only happens 
when using the user-patterns file.

I also changed variable kSaneNumConcreteChars to 0, like suggested.

In my case's segmentation fault valgrind points to:

  Protos = NormProtos->Protos[ClassId];

It seems NormProtos->NumProtos = 167 and ClassId = 167. That's why I used the 
>= instead of the >. Nevertheless, *NormProtos->Protos was NULL?

Original comment by joao.m.s...@gmail.com on 7 Oct 2013 at 11:32

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 8 Oct 2013 at 6:42

GoogleCodeExporter commented 9 years ago
Can you provide test case (image and user-patterns file) for segmentation?

Original comment by zde...@gmail.com on 9 Oct 2013 at 8:41

GoogleCodeExporter commented 9 years ago
I think it happens with a random image and user pattern.

See the README in my testcase in attachment. I have not included 
eng.traineddata.

Original comment by joao.m.s...@gmail.com on 10 Oct 2013 at 12:31

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks joao.m.santos.silva. Fixed in r894.

I changed kSaneNumConcreteChars to 0, because former value did not for with 
example provided on tesseract manpage[1].

[1] 
http://tesseract-ocr.googlecode.com/svn-history/r719/trunk/doc/tesseract.1.html

Original comment by zde...@gmail.com on 20 Oct 2013 at 8:23

GoogleCodeExporter commented 9 years ago
I am trying to use your grc.traineddata for koreader. Koreader an open source 
ebook reader program that is running on the newest eInk display devices from 
Kobo and Kindle. It uses Tesseract for word recogition and dictionary lookup.
Unfortunately the use of grc.traineddata leads to segmentation faults and 
memory allocation errors (, due to the use of 1GB of memory in simulation.) See 
https://github.com/koreader/koreader/issues/327

I now use ell.traineddata and a stardict version I made of Liddle-Scott&Jones 
for dictionary lookup, but a lot of words aren't recognized.

Is there a way to dumb down grc.traineddata _slightly_ so that I can use 
without the errors?

Original comment by Zulde.Zu...@gmail.com on 30 Nov 2013 at 6:41