rmast commented 3 years ago

Get bounding box estimates

print(pytesseract.image_to_boxes(Image.open('/home/robert/Afbeeldingen/scantailorin/210913 nog 2-000na.tif'), lang='nld'))

The horizontal coordinates of the vertical text on the left in the attached image are for some reason all 0, despite the white margin on the left and the width of the characters.

2 1968 0 1982 0 0
0 1985 0 1998 0 0
8 2001 0 2014 0 0
4 2016 0 2030 0 0
- 2041 0 2049 0 0
2 2059 0 2073 0 0
/ 2074 0 2082 0 0
2 2083 0 2097 0 0
2 2116 0 2130 0 0
5 2133 0 2146 0 0
1 2150 0 2158 0 0
9 2165 0 2179 0 0
8 2181 0 2195 0 0
0 2197 0 2211 0 0

The hocr of this part does contain the correct horizontal coordinates, but only for full words, not for characters:

\n 2084\n -\n 2/2\n 251980\n \n

I tested it against tesseract 4.1.1 and 5.0.0-beta-20210916 and the language nld (and eng) from https://github.com/tesseract-ocr/tessdata with these sizes:

15400601 okt 3 11:16 eng.traineddata 8903736 okt 3 11:16 nld.traineddata

This is the test-image: 210913 nog 2-000na.zip

bozhodimitrov commented 3 years ago

Hi @rmast, do you get the same result with tesseract directly?

rmast commented 3 years ago

How should I invoke tesseract for getting this list?

rmast commented 3 years ago

I found it: appending makebox to the cli command. Thanks, I will file an issue at tesseract.

madmaze / pytesseract

image_to_boxes not showing right horizontal coordinates for textangle 90. #386

Get bounding box estimates