Closed tanjunyao7 closed 2 years ago
This issue was reported many times before.
See bounding box.
This issue was reported many times before.
See bounding box.
so which discussion gives an solution?
btw, legacy engine produces similar result, still overlapping.
You'll have to accept the fact that Tesseract does not do a goos job on bounding box detection (Although it usually do a better job with the legacy engine, for English at least).
You'll have to accept the fact that Tesseract does not do a goos job on bounding box detection (Although it usually do a better job with the legacy engine, for English at least).
ok. I'll detect bounding boxes myself. thanks anyway.
Tesseract Version: tesseract 5.0.0-beta-20210916-12-g19cc9 Commit Number: installed following https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel Platform: Linux xxx-computer 5.11.0-27-generic #29~20.04.1-Ubuntu
Current Behavior:
input:
output:
language_model: https://github.com/tesseract-ocr/tessdata/blob/main/chi_sim.traineddata however, the recognized characters are right.
code:
import pytesseract
import cv2
import pandas
custom_oem_psm_config = r'--tessdata-dir /home/xxx/tesseract --oem 3
img = cv2.imread(imagefilename)
d = pytesseract.image_to_data(img,config=custom_oem_psm_config, lang="chi_sim",output_type=pytesseract.Output.DATAFRAME)
n_boxes = len(d['level'])
for i in range(n_boxes):
if d['conf'][i]>0 and d['text'][i]!='NaN':
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
print(d['text'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
Expected Behavior:
bounding boxes should exactly bound their characters
Suggested Fix:
no idea