tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.12k stars 9.39k forks source link

BUG: inaccurate bounding boxes #3600

Closed tanjunyao7 closed 2 years ago

tanjunyao7 commented 2 years ago

Tesseract Version: tesseract 5.0.0-beta-20210916-12-g19cc9 Commit Number: installed following https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel Platform: Linux xxx-computer 5.11.0-27-generic #29~20.04.1-Ubuntu

Current Behavior:

input: Screenshot from 2021-10-19 12-15-37

output: Screenshot from 2021-10-19 12-14-29

language_model: https://github.com/tesseract-ocr/tessdata/blob/main/chi_sim.traineddata however, the recognized characters are right.

code:

import pytesseract import cv2 import pandas

custom_oem_psm_config = r'--tessdata-dir /home/xxx/tesseract --oem 3

img = cv2.imread(imagefilename) d = pytesseract.image_to_data(img,config=custom_oem_psm_config, lang="chi_sim",output_type=pytesseract.Output.DATAFRAME) n_boxes = len(d['level'])

for i in range(n_boxes): if d['conf'][i]>0 and d['text'][i]!='NaN': (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i]) print(d['text'][i]) cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2) cv2.imshow('img', img) cv2.waitKey(0)

Expected Behavior:

bounding boxes should exactly bound their characters

Suggested Fix:

no idea

amitdo commented 2 years ago

This issue was reported many times before.

See bounding box.

tanjunyao7 commented 2 years ago

This issue was reported many times before.

See bounding box.

so which discussion gives an solution?

tanjunyao7 commented 2 years ago

btw, legacy engine produces similar result, still overlapping.

amitdo commented 2 years ago

You'll have to accept the fact that Tesseract does not do a goos job on bounding box detection (Although it usually do a better job with the legacy engine, for English at least).

tanjunyao7 commented 2 years ago

You'll have to accept the fact that Tesseract does not do a goos job on bounding box detection (Although it usually do a better job with the legacy engine, for English at least).

ok. I'll detect bounding boxes myself. thanks anyway.