patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

[ 1546972 ] Tesseract crashed in edge_char_of at dawg.cpp:56 #13

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Tesseract crashed on a specific file. After rebuilding
it with --enable-debug I ran gdb on it:

Starting program: /tmp/tesseract-1.0/tesseract test.tif
test batch
Reading symbols from shared object read from target
memory...done.
Loaded system supplied DSO at 0x4f0a4000
Tesseract Open Source OCR Engine

Program received signal SIGSEGV, Segmentation fault.
0x08102a8e in edge_char_of (dawg=0xb7f3d008,
node=143000, character=105,
word_end=0) at dawg.cpp:56
56 if (edge_occupied (dawg, edge)) {
(gdb) bt
#0 0x08102a8e in edge_char_of (dawg=0xb7f3d008,
node=143000, character=105,
word_end=0) at dawg.cpp:56
#1 0x08102f56 in letter_is_okay (dawg=0xb7f3d008,
node=0xbfdf8cf4,
char_index=3, prevchar=0 '\0', word=0xbfdf8f6b
"DudI", word_end=0)
at dawg.cpp:145
#2 0x080fa781 in append_next_choice (dawg=0xb7f3d008,
node=143000,
permuter=5 '\005', word=0xbfdf8f6b "DudI",
choices=0x8d67360,
char_index=3, this_choice=0x8d3e448, prevchar=0
'\0', limit=0xbfdf8f94,
rating=8.22761822, certainty=-2.4420526,
rating_array=0xbfdf8e20,
certainty_array=0xbfdf8ec4, word_ending=0,
last_word=0, result=0xbfdf8d84)
at permdawg.cpp:188
#3 0x080fabaf in dawg_permute (dawg=0xb7f3d008,
node=143000,
permuter=5 '\005', choices=0x8d67360, char_index=3,
limit=0xbfdf8f94,
word=0xbfdf8f6b "DudI", rating=0, certainty=0,
rating_array=0xbfdf8e20,
certainty_array=0xbfdf8ec4, last_word=0) at
permdawg.cpp:256
#4 0x080fad82 in dawg_permute_and_select
(string=0x815fade "system words:",
dawg=0xb7f3d008, permuter=5 '\005',
character_choices=0x8d67360,
best_choice=0x8d3e498, system_words=1) at
permdawg.cpp:306
#5 0x080fc522 in permute_words
(char_choices=0x8d67360, rating_limit=1000)
at permute.cpp:1542
#6 0x080fda0f in permute_all (char_choices=0x8d67360,
rating_limit=1000,
raw_choice=0xbfdf91bc) at permute.cpp:1046
#7 0x080fdfc2 in permute_characters
(char_choices=0x8d67360, limit=1000,
best_choice=0xbfdf91cc, raw_choice=0xbfdf91bc) at
permute.cpp:1099
#8 0x080d95bd in chop_word_main (word=0x8d2ea28, fx=1,
best_choice=0xbfdf91cc, raw_choice=0xbfdf91bc,
tester=0 '\0',
trainer=0 '\0') at chopper.cpp:436
#9 0x080d744d in cc_recog (tessword=0x8d2ea28,
best_choice=0xbfdf91cc,
best_raw_choice=0xbfdf91bc, tester=0 '\0',
trainer=0 '\0') at tface.cpp:242
#10 0x08070920 in recog_word_recursive (word=0x8d35a78,
denorm=0x8d2e964,
matcher=0x806f860 <tess_default_matcher(PBLOB*,
PBLOB*, PBLOB*, WERD*, DENORM*, BLOB_CHOICE_LIST&)>,
tester=0, trainer=0, testing=0 '\0',
raw_choice=@0x8d2e98c, blob_choices=0xbfdf9308,
outword=@0x8d2e960)
at tfacepp.cpp:165
#11 0x080712e2 in recog_word (word=0x8d35a78,
denorm=0x8d2e964,
matcher=0x806f860 <tess_default_matcher(PBLOB*,
PBLOB*, PBLOB*, WERD*, DENORM*, BLOB_CHOICE_LIST&)>,
tester=0, trainer=0, testing=0 '\0',
raw_choice=@0x8d2e98c, blob_choices=0xbfdf9308,
outword=@0x8d2e960)
at tfacepp.cpp:74
#12 0x0806fc59 in tess_segment_pass2 (word=0x8d35a78,
denorm=0x8d2e964,
matcher=0x806f860 <tess_default_matcher(PBLOB*,
PBLOB*, PBLOB*, WERD*, DENORM*, BLOB_CHOICE_LIST&)>,
raw_choice=@0x8d2e98c, blob_choices=0xbfdf9308,
outword=@0x8d2e960) at tessbox.cpp:95
#13 0x08053ba4 in match_word_pass2 (word=0x8d2e958,
row=0x8c1ea50, x_height=22)
at control.cpp:859
#14 0x080542f3 in classify_word_pass2 (word=0x8d2e958,
row=0x8c1ea50)
at control.cpp:663
#15 0x08055bd6 in recog_all_words (page_res=0xbfdf95a4,
monitor=0x0)
at control.cpp:355
#16 0x0804bb6c in recognize_page
(image_name=@0xbfdf95fc) at tessedit.cpp:159
#17 0x0804a9eb in main (argc=4, argv=0xbfdf96b4) at
tesseractmain.cpp:93

I reduced the .tif to contain only the words that seem
to cause the crash.

Comments

Date: 2007-01-11 20:16
Sender: filipg
Logged In: YES 
user_id=37894
Originator: NO

Can't attach files here so I put them under item 1633726 in
Tracker->Patches.

Hope this helps Mr. Smith :-) The bug seems to be real and will likely
show up
again when tesseract gains a wider audience. i.e., it will need to be
tracked down
and squashed but since it's in DAWG and its ilk, I won't be its squasher
:-)

Cheers,
File

Date: 2007-01-11 20:09
Sender: filipg
Logged In: YES 
user_id=37894
Originator: NO

Clarification, see the attached file "DUDLEY_fault.txt" for explanation of
1,2, and 3. Quicky, the numbers:

1 = B, D, E, [G - J], [L - P], R, U, W, Z
2 = [B - E], G, H, J, [L - R], U, [W - Z]
3 = [A - Z]

Refer to places where the letters were placed and caused the fault.

+------+                   +------+
| 121- |---+--+            | Byb- |
| 2 3  |-----------+       | y Q  | 
+------+   |  |    |       +------+
           v  v    v          ^
           1  2    3          |
Faults for B, E, & Q:         |
Faults for B, G, & Q:         |
Faults for B, H, & Q:         |
Faults for B, P, & Q:         |
Faults for B, Q, & Q:         |
Faults for B, Y, & Q:---------+ for example
[...]

Date: 2007-01-11 20:06
Sender: filipg
Logged In: YES 
user_id=37894
Originator: NO

The submitter's test.tif contains exactly:
+----------------------------+
|                        Dud-|
|ley Observatory             |
+----------------------------+
and sure enough it crashes, however this problem can be reduced to just
five letters: three letters followed by a hyphen on first line a fourth
letter, a space, and fifth letter on the second line.

This is a puzzling fault: it's triggered only by some combinations of
letters and case matters equally weirdly: Case does NOT matter for
combinations that don't trigger the fault (ex: no case-variation of A, B, &
K crashes) but it DOES matter for letters that do crash (ex: the ONLY
combinations of B, E, & Q that DID crash were: beb E q, beb E Q, beB e q,
beB E Q, bEb e q, bEb E q, bEb E Q, bEB E Q, Beb e Q, BEb e q, & BEB E Q)

Noting that the combination "Beb e Q" matches the provided test.tif, I let
my PC do some crunching (3 nested for loops from A to Z running each
combination through tesseract :-) and the following letter-combinations
cause tesseract to crash:

1 = B, D, E, [G - J], [L - P], R, U, W, Z
2 = [B - E], G, H, J, [L - R], U, [W - Z]
3 = [A - Z]

I attached three files: a) the partial set that causes faults, b) a gdb
trace of trigger.txt which contains exactly:
+------+
| Byb- |
| y Q  |
+------+
(Created trigger.tiff with: "cat trigger.txt | pbmtext -font
testing/2helvR18.bdf | pgmtopbm | pnmtotiff > trigger.tiff"), and the
trigger.tiff itself.

In my opinion, this is a logic fault or programming error. My hardware is
a speedy Athlon under Fedora Core 6 (stock) - nothing fancy.

Original issue reported on code.google.com by tmb...@gmail.com on 7 Mar 2007 at 10:31

GoogleCodeExporter commented 9 years ago

Original comment by tmb...@gmail.com on 7 Mar 2007 at 10:38

GoogleCodeExporter commented 9 years ago
I've had a similar problem with a few words:

Monsei-
gneur

(also when it misinterprets the g for an m it still crashes)

muske-
teer's

of harmo-
nies
(mis-interpreted as ofharmon

Original comment by thel...@gmail.com on 22 Oct 2008 at 4:14

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 24 Dec 2008 at 1:05