Problem with GetRegions

marcotn commented 4 years ago

I am running a piece of code like this:

with version 4.1.1

from tesserocr import PyTessBaseAPI, PSM
api = PyTessBaseAPI(psm=PSM.AUTO_OSD)
api_region = PyTessBaseAPI(psm=PSM.AUTO_OSD)

def image_ocr_boxes(img):
    api.Clear()
    print(api.Version())
    image_ram = Image.open(img)
    api.SetImage(image_ram)
    api.Recognize()
    for region in api.GetRegions():
        api_region.SetImage(region[0])
        api_region.Recognize()
        text = api_region.GetUTF8Text()
        region[0].save(f"boxes/img_{counter}.jpg")
        api_region.Clear()

I wrote this to try to save the image of each region to try to understand why the text contained in a region was "kinda cropped".

Saving an image out of each region with region[0].save() I actually see the images saved are cropped at least they look much smaller from the box I find in the region tuple

I have a feeling that there is a problem with coordinates, in one case they are saved as (x,y,w,h) but Image expects something different.

Anybody else having the same problem problem ?

sirfz commented 4 years ago

You're better off posting this on StackOverflow to get help with the tesseract API or its behavior. I'll keep the issue open for the time being for visibility.

bertsky commented 3 years ago

api = PyTessBaseAPI(psm=PSM.AUTO_OSD)
api_region = PyTessBaseAPI(psm=PSM.AUTO_OSD)

At this point you have initialized two independent instances of Tesseract, which both loaded the default lang='eng' LSTM model. (At least one model is needed, even for segmentation.)

image_ram = Image.open(img)
    api.SetImage(image_ram)

If you have image files anyway, you can skip the Pillow step and just use api.SetImageFile directly (which is based on Leptonica's own pix image format).

for region in api.GetRegions():
        api_region.SetImage(region[0])

That's quite a unique pattern you have invented here! So you make the api Tesseract instance give you PIL.Image / bbox tuples, the former of which you then pass on to the api_region Tesseract instance for recognition.

I don't fully grasp why you came up with that, but there are a couple of issues here:

it's inefficent; the first instance already has the results you are looking for in its internal state. Just query api.GetIterator() for block-level results!
it may yield suboptimal results. The second instance only gets to see the block images, which means any components that belong to neighbouring regions but are encompassed by the same bounding box now get re-interpreted at the lower-level by the second instance – they may or may not be identified as belonging to that region. In the worst case, because non-rectangular regions necessarily overlap, text may be repeated across regions and chopped to pieces. It won't be pretty.
it will try to get text results out of non-text blocks, because GetRegions is nothing but a wrapper to GetComponentImages with text_only=false.
the second instance's PSM is set to AUTO_ONLY which is clearly wrong in your case, because you are already passing it block-level images. So you should at least use PSM.SINGLE_BLOCK.

sirfz / tesserocr

Problem with GetRegions #228