tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.72k stars 9.45k forks source link

Uzn results differ from manual pre-cropping #2156

Open sweco-sekrsv opened 5 years ago

sweco-sekrsv commented 5 years ago

Environment

Current Behavior:

I'm using the best version of the english traineddata https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddat

Im getting different results using a larger image together with a uzn file versus manually cropping the image to the bound in the uzn file and running the same command. Can't see why the found text should differ?

Text in image:S21-3002-84B-A0000-0S-2107

UZN version - Gives the wrong recognition of the text: 521-3002-84B-A0000-05-2107 (first S becomes a 5 and also the second S becomes a 5 )

tesseract.exe "D:\test\S21uzn.tif" "D:\test\S21uzneng" --psm 4 --oem 1 -l eng hocr

Manually cropped version - Gives the correct results of the text: S21-3002-84B-A0000-0S-2107

tesseract.exe "D:\test\S21cropped.tif" "D:\test\S21croppedeng" --psm 4 --oem 1 -l eng hocr

Example files attached. tess_uzn_bug.zip

Expected Behavior:

Uzn version to output the same recognized text as manually cropped version.

Suggested Fix:

Unknown

amitdo commented 5 years ago

Try each option separately:

sweco-sekrsv commented 5 years ago

I think using uzn-files only works for --psm 4 and higher. But I tried you suggestions and here are the results: