print(pytesseract.image_to_boxes(Image.open('/home/robert/Afbeeldingen/scantailorin/210913 nog 2-000na.tif'), lang='nld'))
The horizontal coordinates of the vertical text on the left in the attached image are for some reason all 0, despite the white margin on the left and the width of the characters.
Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('/home/robert/Afbeeldingen/scantailorin/210913 nog 2-000na.tif'), lang='nld'))
The horizontal coordinates of the vertical text on the left in the attached image are for some reason all 0, despite the white margin on the left and the width of the characters.
The hocr of this part does contain the correct horizontal coordinates, but only for full words, not for characters:
<span class=\'ocr_line\' id=\'line_1_1\' title="bbox 111 1289 133 1532; textangle 90; x_size 28.416666; x_descenders 7.1041665; x_ascenders 7.1041665">\n <span class=\'ocrx_word\' id=\'word_1_1\' title=\'bbox 112 1470 133 1532; x_wconf 88\'>2084</span>\n <span class=\'ocrx_word\' id=\'word_1_2\' title=\'bbox 124 1451 127 1459; x_wconf 88\'>-</span>\n <span class=\'ocrx_word\' id=\'word_1_3\' title=\'bbox 111 1403 133 1441; x_wconf 96\'>2/2</span>\n <span class=\'ocrx_word\' id=\'word_1_4\' title=\'bbox 112 1289 133 1384; x_wconf 96\'>251980</span>\n </span>\n
I tested it against tesseract 4.1.1 and 5.0.0-beta-20210916 and the language nld (and eng) from https://github.com/tesseract-ocr/tessdata with these sizes:
15400601 okt 3 11:16 eng.traineddata 8903736 okt 3 11:16 nld.traineddata
This is the test-image: 210913 nog 2-000na.zip