Uzn results differ from manual pre-cropping

sweco-sekrsv commented 5 years ago

Environment

Tesseract Version: v4.0.0.20181030. UB Mannheim windows binary from here https://github.com/UB-Mannheim/tesseract/wiki Leptonica 1.75.3
Platform: Windows 10 64-bit

Current Behavior:

I'm using the best version of the english traineddata https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddat

Im getting different results using a larger image together with a uzn file versus manually cropping the image to the bound in the uzn file and running the same command. Can't see why the found text should differ?

Text in image:S21-3002-84B-A0000-0S-2107

UZN version - Gives the wrong recognition of the text: 521-3002-84B-A0000-05-2107 (first S becomes a 5 and also the second S becomes a 5 )

tesseract.exe "D:\test\S21uzn.tif" "D:\test\S21uzneng" --psm 4 --oem 1 -l eng hocr

Manually cropped version - Gives the correct results of the text: S21-3002-84B-A0000-0S-2107

tesseract.exe "D:\test\S21cropped.tif" "D:\test\S21croppedeng" --psm 4 --oem 1 -l eng hocr

Example files attached. tess_uzn_bug.zip

Expected Behavior:

Uzn version to output the same recognized text as manually cropped version.

Suggested Fix:

Unknown

amitdo commented 5 years ago

Try each option separately:

without '--psm n'
--psm 6
--psm 3

sweco-sekrsv commented 5 years ago

I think using uzn-files only works for --psm 4 and higher. But I tried you suggestions and here are the results:

Without the '--psm n' option it does not use the uzn file but do the recognition for the full image
The same applies for '--psm 3', it does not use the uzn file but do the recognition for the full image
Using '--psm 6' ends up with the same faulty recognized text as with '--psm 4': 521-3002-84B-A0000-05-2107

tesseract-ocr / tesseract