tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.56k stars 9.44k forks source link

tesseract 4 --oem 0 baseline error with rotated pages #2086

Closed mhechthz closed 4 years ago

mhechthz commented 5 years ago

Before you submit an issue, please review the guidelines for this repository.

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that reproduces the problem is provided. Ideally this test case should not have any external dependencies. Provide a copy of the image or link to files for the test case.

Please delete this text and fill in the template below.


Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

mhechthz commented 5 years ago

Hello,

I recently installed Tesseract 4.0 tesseract-ocr-w64-setup-v4.0.0.20181030.exe on my Win7 System. To check the page orientation I used the old OCR method, i. e. --oem 0 since it is much faster than LSTM, and hocr output. With the information of textangle I rotated the tiff files if necessary an than did LSTM-OCR and produced an overlayed PDF.

With the new Tesseract version I get always no textangle information if the page is rotated by 180 degree and no text is recognized. The same is unfortunately for LSTM where also no text is recognized.

Are there any chages or errors? How to get text orientation if 180 degree rotated?

amitdo commented 5 years ago

Please provide:

mhechthz commented 5 years ago

tesseract.exe "image.tif" "image.tif_ocr" --oem 0 -l deu+eng hocr

By the way: using psm option is useless because rotation by eg. 10 degree (from scanning) is recognized as 0 degree. With the last 4.0 beta version all was ok.

amitdo commented 5 years ago

What the output in the terminal?

Can you provide the image?

mhechthz commented 5 years ago

I meanwhile uninstalled the version I used https://digi.bib.uni-mannheim.de/tesseract/tesseract -ocr-w64-setup-v4.0.0.20181030.exe and went back tu the last beta https://digi.bib.uni-mannheim.de/tesseract/tesseract -ocr-w64-setup-v4.0.0-beta.4.20180912.exe that does the job. Unfortunately I don't have the output anymore but the hocr files from both tesseract versions for a 180 degree rotated page (see attachement: the one with textangle 180 is the beta and the one without is the "stable" release, I also added the tif file as jpeg because tiff is to large for your server and was rejected). Is this sufficient? The output on the shell was inconspicuous.

Am Mi., 28. Nov. 2018 um 19:10 Uhr schrieb Amit D. <notifications@github.com

:

What the output in the terminal?

Can you provide the image?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2086#issuecomment-442547929, or mute the thread https://github.com/notifications/unsubscribe-auth/AjW4yE-JoPElWh5Tg31DDi8UlWXPFlwwks5uztGbgaJpZM4Y4BjN .

zdenop commented 5 years ago

Can you provide image for testing?

mhechthz commented 5 years ago

Well you can take any image that is rotated by 180 degree, since it happens for any document with rotated pages.

The "wrong" hocr file looks like this

<span class='ocr_line' id='line_1_12' title="bbox 338 580 1021 614; baseline -0.001 -7; x_size 34.224701; x_descenders 7.7738094; x_ascenders 7.7738094">
      <span class='ocrx_word' id='word_1_34' title='bbox 338 588 423 614; x_wconf 57' lang='eng'>-IN0S</span>
      <span class='ocrx_word' id='word_1_35' title='bbox 435 588 499 614; x_wconf 92' lang='eng'>pun</span>
      <span class='ocrx_word' id='word_1_36' title='bbox 511 580 694 614; x_wconf 17' lang='eng'>-HunmMIday</span>
      <span class='ocrx_word' id='word_1_37' title='bbox 707 588 756 614; x_wconf 36' lang='eng'>SIP</span>
      <span class='ocrx_word' id='word_1_38' title='bbox 769 588 919 613; x_wconf 0' lang='eng'>9}19Sqa</span>
      <span class='ocrx_word' id='word_1_39' title='bbox 921 588 1021 606; x_wconf 53' lang='eng'>-SUSUL</span>
     </span>

what I expected was:

 <span class='ocr_line' id='line_1_4' title="bbox 1311 3113 2309 3149; textangle 180; x_size 33; x_descenders 8; x_ascenders 7">
      <span class='ocrx_word' id='word_1_11' title='bbox 2184 3118 2309 3143; x_wconf 96'>werden</span>
      <span class='ocrx_word' id='word_1_12' title='bbox 2023 3119 2171 3145; x_wconf 96'>abermals</span>
      <span class='ocrx_word' id='word_1_13' title='bbox 1894 3120 2010 3145; x_wconf 96'>kleiner</span>
      <span class='ocrx_word' id='word_1_14' title='bbox 1819 3120 1883 3145; x_wconf 96'>und</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 1624 3113 1808 3146; x_wconf 87'>kompakter,</span>
      <span class='ocrx_word' id='word_1_16' title='bbox 1554 3122 1610 3147; x_wconf 96'>Die</span>
      <span class='ocrx_word' id='word_1_17' title='bbox 1380 3122 1542 3149; x_wconf 93'>Notebooks</span>
      <span class='ocrx_word' id='word_1_18' title='bbox 1311 3124 1379 3142; x_wconf 92'>er-</span>
     </span>

This seems to be is independent of --oem 0 or --oem 1.

New information: This version https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0-rc3.20181014.exe is also not able to recognise 180 degree rotated pages. Up to the last beta all works well.

zdenop commented 5 years ago

image for testing: phototest-r180 tesseract phototest-r180.png -

produce:

"Xo} Aze| sy} 1ano0 padwinl Bop umoiq
┬ąaInb 8y "xoj Aze| sy} son0 padwn(
Bop umoiq oinb ay) "xoy Aze| ayy Jeno
padwn( Bop umougq oInb sy xoy Aze|
8y} Jano padwinl Bop umouq yoinb sy
ÔÇťJewloy oyl Jo

sadA} |le uo s)I10m )l i 88S puE 8pod 190
3y} 1881 0} 1xa} Julod Z| 4o 10| ÔéČ s siy)
amitdo commented 4 years ago

@mhechthz

You need to add --psm 1 to the command.

https://github.com/tesseract-ocr/tesseract/commit/ecfee53bac5