madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.8k stars 719 forks source link

image_to_boxes not showing right horizontal coordinates for textangle 90. #386

Closed rmast closed 3 years ago

rmast commented 3 years ago

Get bounding box estimates

print(pytesseract.image_to_boxes(Image.open('/home/robert/Afbeeldingen/scantailorin/210913 nog 2-000na.tif'), lang='nld'))

The horizontal coordinates of the vertical text on the left in the attached image are for some reason all 0, despite the white margin on the left and the width of the characters.

2 1968 0 1982 0 0
0 1985 0 1998 0 0
8 2001 0 2014 0 0
4 2016 0 2030 0 0
- 2041 0 2049 0 0
2 2059 0 2073 0 0
/ 2074 0 2082 0 0
2 2083 0 2097 0 0
2 2116 0 2130 0 0
5 2133 0 2146 0 0
1 2150 0 2158 0 0
9 2165 0 2179 0 0
8 2181 0 2195 0 0
0 2197 0 2211 0 0

The hocr of this part does contain the correct horizontal coordinates, but only for full words, not for characters:

<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">\n <span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 112 1470 133 1532; x_wconf 88\'>2084</span>\n <span class=\'ocrx_word\' id=\'word_1_2\' title=\'bbox 124 1451 127 1459; x_wconf 88\'>-</span>\n <span class=\'ocrx_word\' id=\'word_1_3\' title=\'bbox 111 1403 133 1441; x_wconf 96\'>2/2</span>\n <span class=\'ocrx_word\' id=\'word_1_4\' title=\'bbox 112 1289 133 1384; x_wconf 96\'>251980</span>\n </span>\n

I tested it against tesseract 4.1.1 and 5.0.0-beta-20210916 and the language nld (and eng) from https://github.com/tesseract-ocr/tessdata with these sizes:

15400601 okt 3 11:16 eng.traineddata 8903736 okt 3 11:16 nld.traineddata

This is the test-image: 210913 nog 2-000na.zip

bozhodimitrov commented 3 years ago

Hi @rmast, do you get the same result with tesseract directly?

rmast commented 3 years ago

How should I invoke tesseract for getting this list?

rmast commented 3 years ago

I found it: appending makebox to the cli command. Thanks, I will file an issue at tesseract.