Closed GoogleCodeExporter closed 9 years ago
I have encountered the same issue I think.
Different font size - e.g. 23px or 25px - cause no segfault.
Other fonts at 24px cause no segfault.
Generating a box file with tesseract and then calling box.train.stderr causes
no segfault.
> tesseract -v
tesseract 3.02.02
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.6.3 : libtiff 4.0.3 : zlib 1.2.8
> gdb -ex "run xx.Tengwar_Noldor.24.tif xx.Tengwar_Noldor.24 box.train.stderr"
-ex bt tesseract
[...]
Starting program: /usr/bin/tesseract xx.Tengwar_Noldor.24.tif
xx.Tengwar_Noldor.24 box.train.stderr
[...]
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[...]
Found 959 good blobs.
Leaving 981 unlabelled blobs in 0 words.
12 remaining unlabelled words deleted.
TRAINING ... Font name = Tengwar_Noldor
!isnan(Feature->Params[i]):Error:Assert failed:in file
/var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.c
pp, line 78
Program received signal SIGSEGV, Segmentation fault.
ERRCODE::error (this=this@entry=0x7ffff7cd8218 <_ZL13ASSERT_FAILED>,
caller=caller@entry=0x7ffff7a6f663 "!isnan(Feature->Params[i])",
action=action@entry=ABORT,
format=format@entry=0x7ffff7a50ea7 "in file %s, line %d")
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccutil/errcode.cpp:87
#0 ERRCODE::error (this=this@entry=0x7ffff7cd8218 <_ZL13ASSERT_FAILED>,
caller=caller@entry=0x7ffff7a6f663 "!isnan(Feature->Params[i])",
action=action@entry=ABORT,
format=format@entry=0x7ffff7a50ea7 "in file %s, line %d")
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccutil/errcode.cpp:87
#1 0x00007ffff79ee318 in ExtractMicros (Blob=<optimized out>, denorm=...)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.cpp:78
#2 0x00007ffff79df152 in ExtractFlexFeatures (FeatureDefs=...,
Blob=Blob@entry=0x21e59f0, denorm=...)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/flexfx.cpp:53
#3 0x00007ffff79decfe in ExtractBlobFeatures (FeatureDefs=..., denorm=...,
Blob=Blob@entry=0x21e59f0)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/extract.cpp:53
#4 0x00007ffff79d5afb in LearnBlob (FeatureDefs=..., FeatureFile=0x22098a0,
Blob=0x21e59f0, denorm=..., BlobText=0x253b418 "k 674 926 691 951 0",
FontName=0x220d0e8 "Tengwar_Noldor")
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/blobclass.cpp:109
#5 0x00007ffff79d5ca3 in LearnBlob (FeatureDefs=..., filename=...,
Blob=Blob@entry=0x21e59f0, denorm=...,
BlobText=BlobText@entry=0x253b418 "k 674 926 691 951 0")
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/blobclass.cpp:99
#6 0x00007ffff79d28e4 in tesseract::Classify::LearnPieces (this=this@entry=
0x6c7a00, filename=filename@entry=0x6dc4e8 "xx.Tengwar_Noldor.24",
start=start@entry=10, length=1, threshold=threshold@entry=0,
segmentation=segmentation@entry=tesseract::CST_WHOLE,
correct_text=0x253b418 "k 674 926 691 951 0", word=word@entry=0x22221d0)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/adaptmatch.cpp:438
#7 0x00007ffff79d46a4 in tesseract::Classify::LearnWord (
this=this@entry=0x6c7a00, filename=0x6dc4e8 "xx.Tengwar_Noldor.24",
rejmap=<optimized out>, rejmap@entry=0x0, word=word@entry=0x22221d0)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/adaptmatch.cpp:304
#8 0x00007ffff78c2fdb in tesseract::Tesseract::ApplyBoxTraining (this=
0x6c7a00, filename=..., page_res=<optimized out>)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/ccmain/applybox.cpp:791
#9 0x00007ffff78bd43d in tesseract::TessBaseAPI::Recognize (
this=this@entry=0x7fffffffd780, monitor=monitor@entry=0x0)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:738
#10 0x00007ffff78beaad in tesseract::TessBaseAPI::ProcessPage (
this=this@entry=0x7fffffffd780, pix=0x6dca20,
page_index=page_index@entry=0,
filename=filename@entry=0x7fffffffdc96 "xx.Tengwar_Noldor.24.tif",
retry_config=retry_config@entry=0x0,
timeout_millisec=timeout_millisec@entry=0,
text_out=text_out@entry=0x7fffffffd730)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:931
#11 0x00007ffff78beda2 in tesseract::TessBaseAPI::ProcessPages (
this=this@entry=0x7fffffffd780,
filename=filename@entry=0x7fffffffdc96 "xx.Tengwar_Noldor.24.tif",
retry_config=retry_config@entry=0x0,
timeout_millisec=timeout_millisec@entry=0,
text_out=text_out@entry=0x7fffffffd730)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/baseapi.cpp:846
#12 0x0000000000401ee0 in main (argc=<optimized out>, argv=0x7fffffffd928)
at /var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/api/tesseractmain.cpp:183
Original comment by tho...@kuehne.cn
on 2 Sep 2013 at 9:13
Attachments:
Issue 976 has been merged into this issue.
Original comment by zde...@gmail.com
on 2 Sep 2013 at 5:46
Original comment by zde...@gmail.com
on 2 Sep 2013 at 5:55
thanks for the reply. However, I don't think my problem is the same as issue
894 because there's not any old tesseract in my machine. When I ran "find /
-name libtesseract.so", only the following items were found:
/usr/local/lib/libtesseract.so
/extdisk2/tools/tesseract-ocr/api/.libs/libtesseract.so
Original comment by chenjie2...@gmail.com
on 3 Sep 2013 at 12:53
I only find one libtesseract so issue 894 isn't the cause.
> find / -name libtesseract*
/usr/lib64/libtesseract.so.3.0.2
/usr/lib64/libtesseract.so.3
/usr/lib64/libtesseract.so
> file /usr/lib64/libtesseract.so /usr/lib64/libtesseract.so.3
/usr/lib64/libtesseract.so.3.0.2
/usr/lib64/libtesseract.so: symbolic link to
`libtesseract.so.3.0.2'
/usr/lib64/libtesseract.so.3: symbolic link to
`libtesseract.so.3.0.2'
/usr/lib64/libtesseract.so.3.0.2: ELF 64-bit LSB shared
object, x86-64, version 1 (SYSV), dynamically linked, stripped
Original comment by tho...@kuehne.cn
on 3 Sep 2013 at 7:38
Really? Did you read comment #2[1]? Did you tried it?
[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=894#c2
Original comment by zde...@gmail.com
on 4 Sep 2013 at 7:12
Yes I tried to use the absolute paths and it still failed.
Please note I've used box.train on a quite a lot of images and this is the only
one that causes a segfault.
> TESSDATA_PREFIX=/usr/share/ /usr/bin/tesseract xx.Tengwar_Noldor.24.tif
xx.Tengwar_Noldor.24 box.train.stderr
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[...]
Found 959 good blobs.
Leaving 981 unlabelled blobs in 0 words.
12 remaining unlabelled words deleted.
TRAINING ... Font name = Tengwar_Noldor
!isnan(Feature->Params[i]):Error:Assert failed:in file
/var/tmp/portage/app-text/tesseract-3.02-r1/work/tesseract-3.02.02/classify/mf.c
pp, line 78
Segmentation fault
Original comment by tho...@kuehne.cn
on 5 Sep 2013 at 1:47
Thank you, zde...@gmail.com. After I downloaded the svn version, the assert
failure disappeared. However, the APPLY BOX failures still exist. I've checked
the box file many times but couldn't find anything wrong.
Original comment by chenjie2...@gmail.com
on 6 Sep 2013 at 12:40
@thomas@kuehne.cn:
You are not using svn version as suggested. As you can see at issues 894 and
comment #8 here, issue is fixed there.
Original comment by zde...@gmail.com
on 6 Sep 2013 at 8:07
@chenjie2001
1. jpg is not good format for OCR. Your image looks like generated so
definitely you should use different format png or tiff but without jpeg
compression)
2. If you do training, try to use binary images and not multicolored.
3. DPI on jpeg is 72. That is to low...
I play a little bit with your image (I binarized it, used png format, and
changed DPI information to 92) and it works for me:
tesseract test.STSONG.exp1.png test.STSONG.exp1 box.train
Tesseract Open Source OCR Engine v3.02.03 with Leptonica
row xheight=24, but median xheight = 29.125
row xheight=22, but median xheight = 29.125
APPLY_BOXES:
Boxes read from boxfile: 333
Found 333 good blobs.
TRAINING ... Font name = STSONG
Generated training data for 79 words
So the result is - image pre-processing seems to be key for success.
Original comment by zde...@gmail.com
on 6 Sep 2013 at 8:17
Attachments:
Original issue reported on code.google.com by
chenjie2...@gmail.com
on 1 Sep 2013 at 12:49Attachments: