patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tesseract crashes when it tries to process the attached tif file #64

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. run tesseract on the attached file
C:\tesseract\tesseract.exe test.tif test -l nld

What is the expected output? 
That a file "test.txt" is created

What do you see instead?
---------------------------
Microsoft Visual C++ Runtime Library
---------------------------
Runtime Error!

Program: c:\tesseract\tesseract.exe

abnormal program termination

---------------------------
OK   
---------------------------

What version of the product are you using? On what operating system?
Build dated 30 august 2007 on a windows 2000 system

Please provide any additional information below.
Please try the attached file:

(created with usage of image magick tool, since tesseract didnt understand
the tiff-encoding and multipage. With the commands:
c:\tesseract\ImageMagick\convert.exe ../tif/bw_advies.tif -compress None
singlepage.%%d.tif
c:\tesseract\ImageMagick\montage.exe singlepage.*.tif -geometry +0+0  -tile
1 montage.tif
)

ps: The document attached is for testing purposes, there are copyrights on
this document

Original issue reported on code.google.com by eywitteveen on 4 Sep 2007 at 6:15

GoogleCodeExporter commented 9 years ago

Original comment by eywitteveen on 4 Sep 2007 at 6:17

Attachments:

GoogleCodeExporter commented 9 years ago
Tested with paintbrush. When tried to save, paintbrush is crashed!
When checked with infanview, abnormal resolution.  With help of 
scanner, the present tif be scanned and then output may be tested.

Original comment by withbles...@gmail.com on 4 Sep 2007 at 6:00

GoogleCodeExporter commented 9 years ago
Maybe there is an other way to process mulipage-compressed tif's with 
tesseract? I
prefer applications which i can run from the commandline, so i can run 
everything in
batch

About the bug itselve:
First of all, did you mean "IrfanView" instead of "infanview"? I've installed
irfanview 4.0

I've openend the image with:
- Succesfull with Irfanview 4.00 (no messages what so ever)
- Succesfull with Paint Version 5.0 build 2195: Service pack 4 (only the 
background
is green (i could save as bmp and tiff (background in saved tif also green)))
- "Imaging for Windows Preview" with no problems
- Gimp 2.2.12 (a warning that the resolution was meaningless)
With Gimp is saved the document without compression = "None" as test2.tif

This new image (test2.tif) also gives me this problem:
- Crash in tesseract
- Gimp opens this test2 successfull
- "Imaging for Windows Preview" succesfull
- Succesfull with Paint Version 5.0 build 2195: Service pack 4 (now the 
background is
white :D )

Original comment by eywitteveen on 5 Sep 2007 at 11:20

Attachments:

GoogleCodeExporter commented 9 years ago
Tesseract uses 16 bits internally for pixel coordinates, so your image at 42900
pixels high is too big. While a fix is unlikely to be forthcoming soon, I might 
make
it more gracefully reject such images.
You have 3 possibilities:
Convert your multipage tiff to multiple single-page tiffs and process 
separately.
Change the code to cope properly with multipage tiffs and send a patch.
Wait for someone else to make the change. (It will happen eventually.)

Original comment by theraysm...@gmail.com on 6 Sep 2007 at 12:07

GoogleCodeExporter commented 9 years ago
So I was interested in the project and wanted to get my feet wet so I thought 
that
this might be an interesting / easy (at least conceptually) problem to get a 
feel for
the code.  The attached diff file appears to have no significant negative 
impact on
the tests provided (confer initial.summary vs change.summary) and so far as I 
know
didn't cause tesseract to crash with the first test.tif provided.  I say so far 
as I
know because after 8hrs of running I killed it.

A brief debugging session leads me to believe the problem is that the problem 
is that
you have too many blobs on one image.

If that's the case then my feeling is that I should see about adding my current
changes to the main source and going from there.  I'm new to OSS and I didn't 
see any
instructions on where to put the changes though so if you could point me in the 
right
direction I'd appreciate it.

Original comment by ianh...@gmail.com on 20 Sep 2007 at 3:30

Attachments:

GoogleCodeExporter commented 9 years ago
I found the bug, and it works for this image, though not terribly well.

Original comment by ianh...@gmail.com on 23 Sep 2007 at 6:23

Attachments:

GoogleCodeExporter commented 9 years ago
Thank you for the great job, it looked like quite some work to find all the
references!(i didnt compile it yet, working on windoze here) I assume that this 
will
be put into the upstream?

We still need to convert multipage tiff's encoded, are there currently people
interrested in this functionality, or is it recommended to use imagemagick to 
do some
conversion before using tesseract?

Things i currently do with imagemagick:
- remove the compression from the tiff image
- break the image down from multiple pages to a single page

Original comment by eywitteveen on 24 Sep 2007 at 6:31

GoogleCodeExporter commented 9 years ago
Tesseract now (3.00) supports multipage tiffs with libtiff or leptonica.

Original comment by theraysm...@gmail.com on 20 May 2010 at 6:57