sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.02k stars 255 forks source link

Tesserocr does not read UZN files #305

Closed DevKretov closed 2 years ago

DevKretov commented 2 years ago

Hello,

when I want to specify the regions of interest via .UZN file (zones file), tesserocr does not pay attention to this file, which is specified according to this tutorial.

The code I use:

from tesserocr import PyTessBaseAPI

image_save_path = 'some/path/to/jpg/file.jpg'
# uzn path is 'some/path/to/jpg/file.uzn' 

_tesseract_api = PyTessBaseAPI(
    lang='ces',
    psm=4,
    oem=1,
    path=os.getenv('TESSDATA_PREFIX')
)
_tesseract_api.ReadConfigFile("tsv")
_tesseract_api.ReadConfigFile("logfile")
_tesseract_api.SetImageFile(image_save_path)
_tesseract_api.Recognize()

_tesseract_api.GetUTF8Text()

The code returns the whole contents of the page, not the one specified in the OZN file.

Is it a bug or am I doing something wrong? Thanks!

zdenop commented 2 years ago

First of all: why you want to use uzn file if you can use API/SetRectangle? uzn file is for tesseract executable users... Next: https://github.com/tesseract-ocr/tesseract/issues/3837

DevKretov commented 2 years ago

I want to use UZN file in order to get away from Tesseract's inner segmentation, which I cannot control and which fails on my documents - it does not find all regions of text in sparsely distributed text on a page.

Finally, I was able to set up UZN file with the help of API/ProcessPage, where I specified the filename parameter with the path to the image, where the UZN file is also present. Finally, it worked.