core dump upon recognizing untrained quote character

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

All files are in http://groups.google.com/group/tesseract-ocr/files

Use the training data in fraktur.tgz.
Use the image in testpageBefreiung.tif.

tsqali >> tesseract testpageBefreiung.tif testpageBefreiung -l deu-f
Tesseract Open Source OCR Engine
Bad unichar_repr: '"', length: 1,1
tesseract: unicharset.cpp:70: const UNICHAR_ID
UNICHARSET::unichar_to_id(const char*, int) const: Assertion
`ids.contains(unichar_repr, length)' failed.
Aborted (core dumped)
tsqali >> gdb /usr/local/bin/tesseract core 
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...Using host libthread_db
library "/lib/tls/i686/cmov/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /usr/lib/libtiff.so.4...done.
Loaded symbols for /usr/lib/libtiff.so.4
Reading symbols from /usr/lib/libstdc++.so.6...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/tls/i686/cmov/libm.so.6...done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/i686/cmov/libc.so.6...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /usr/lib/libjpeg.so.62...done.
Loaded symbols for /usr/lib/libjpeg.so.62
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Core was generated by `tesseract testpageBefreiung.tif testpageBefreiung -l
deu-f'.
Program terminated with signal 6, Aborted.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7c78770 in raise () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7c79ef3 in abort () from /lib/tls/i686/cmov/libc.so.6
#3  0xb7c71dbb in __assert_fail () from /lib/tls/i686/cmov/libc.so.6
#4  0x0812d2ef in UNICHARSET::unichar_to_id (this=0x817b42c,
unichar_repr=0x8343437 "\"", length=1) at unicharset.cpp:70
#5  0x0806480a in flip_0O (word=0x8273fa8) at ../ccutil/unicharset.h:165
#6  0x080694f5 in make_reject_map (word=0x8273fa8, blob_choices=0xbfafb674,
row=0x8274680, pass=1) at reject.cpp:344
#7  0x08051150 in classify_word_pass1 (word=0x8273fa8, row=0x8274680,
cluster_adapt=0 '\0', char_clusters=0x0, chars_waiting=0x0)
    at control.cpp:674
#8  0x08053247 in recog_all_words (page_res=0x8271170, monitor=0x0,
target_word_box=0x0, dopasses=0) at control.cpp:308
#9  0x0804ac31 in TessBaseAPI::Recognize (block_list=0xbfafbb5c,
monitor=0x0) at baseapi.cpp:433
#10 0x0804b83e in TessBaseAPI::RecognizeToString () at baseapi.cpp:406
#11 0x0804a7ad in main (argc=5, argv=0xbfafbcc4) at tesseractmain.cpp:157
(gdb) quit
tsqali >> 

What is the expected output? What do you see instead?

The program should generate testpageBefreiung.txt and not dump core.

What version of the product are you using? On what operating system?

I'm using tesseract-ocr 2.01 with this patch:
http://baqaqi.chi.il.us/buecher/tesseract/patches/unicharset_debug.patch

tsqali >> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 6.10
Release:        6.10
Codename:       edgy
tsqali >> uname -a
Linux tsqali 2.6.17-12-generic #2 SMP Mon Jul 16 19:37:58 UTC 2007 i686
GNU/Linux

Please provide any additional information below.

The ASCII double quote character is not in the training set. Fraktur uses
left and right guillemots rather than double quotes, both of which have
many training examples. 

I believe that we are blowing up on the first right guillemot on line 8.

Original issue reported on code.google.com by piggy.ya...@gmail.com on 1 Sep 2007 at 12:33

GoogleCodeExporter commented 9 years ago

 With reference to " testpageBefreiung.tif" It appears lang:<deu> and if so,upload
few lines typed text in <deu> to enable me to test in MSwindows. In case, if it 
works in
MSwindows, it should also work in "Ubuntu".

Original comment by withbles...@gmail.com on 2 Sep 2007 at 12:11

GoogleCodeExporter commented 9 years ago

There is a simple fix for this. Patch will be issued soon.

Original comment by theraysm...@gmail.com on 6 Sep 2007 at 12:26

Changed state: Started

GoogleCodeExporter commented 9 years ago

Here is a patch that solves the problem the wrong way. It disables double quote
detection. The right solution would involve not barfing if the recognizer 
generates
characters which are not in the training set. I don't immediately see how to do 
that.

Original comment by piggy.ya...@gmail.com on 2 Dec 2007 at 4:04

Attachments:

disable_double_quote.patch

GoogleCodeExporter commented 9 years ago

This was fixed in a previous release.

Original comment by theraysm...@gmail.com on 28 Dec 2008 at 6:45

Changed state: Fixed

patcharats / tesseract-ocr

core dump upon recognizing untrained quote character #62