harinath-palavalli / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tesseract output for certain page in multipage document not the same as output for that page alone #936

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Preform OCR on the attached file "eu-004.tiff" with just "tesseract 
eu-004.tiff out hocr"
2. Do the same for the png attached below
3. You should note that the OCR output for tables 6.1 and 6.2 is missing a 
bunch of numbers in the single page png. This does not appear to be dependent 
on the file format, just that if OCR is preformed on that page alone it doesn't 
seem to work right.

What is the expected output? What do you see instead?
the expected output is the output for tables 6.1 and 6.2 in the tiff file, it's 
almost perfect in fact. however in the OCR output from the png only basically 
just the far right column has anything and even it is missing stuff.

What version of the product are you using? On what operating system?
I'm using version 3.02

Please provide any additional information below.
I have attached the files I mention in this. I'm using OCR to do table 
recognition; previously this hadn't been an issue because OCR was being 
preformed on the whole thing not one page at a time. Some changes to the 
application however changed this fact and this bug arose.

Original issue reported on code.google.com by jake.h.e...@gmail.com on 3 Jun 2013 at 9:06

GoogleCodeExporter commented 9 years ago
I realized that I forgot to attach the files I was talking about so I attached 
them here

Original comment by jake.h.e...@gmail.com on 3 Jun 2013 at 9:19

Attachments: