madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.84k stars 721 forks source link

`image_to_osd` with `PIL.Image` argument raises `TesseractError` for tesseract 5.0.1 #416

Closed caerulescens closed 2 years ago

caerulescens commented 2 years ago

description

pytesseract raises a TesseractError during orientation and script detection when supplying image data through pillow.

pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_k2k7ao6h loaded. 
Estimating resolution as 395 UZN file /tmp/tess_k2k7ao6h loaded. 
Warning. 
Invalid resolution 0 dpi. Using 70 instead. 
Too few characters. 
Skipping this page Error during processing.

reproducing

os: Debian GNU/Linux 10 (buster) python version: 3.9.1 pytesseract version: 0.3.9 pillow version: 9.0.1

build tesseract versions

#!/usr/bin/env bash

# settings
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
BUILD_DIRECTORY="$DIR/build"
TESSERACT_VERSION=${1:-5.0.1}
TESSDATA_VERSION=${2:-4.1.0}

# init build dir
mkdir -p $BUILD_DIRECTORY/tesseract

# install requirements
sudo apt-get install -y automake ca-certificates g++ git libtool libleptonica-dev make pkg-config

# build tesseract
git clone git@github.com:tesseract-ocr/tesseract.git $DIR/tesseract
cd $DIR/tesseract
git checkout $TESSERACT_VERSION
./autogen.sh
./configure --prefix=$BUILD_DIRECTORY/tesseract/$TESSERACT_VERSION
make -j$(nproc)
make install
sudo ldconfig

# init share dir
mkdir $BUILD_DIRECTORY/tesseract/$TESSERACT_VERSION/share

# link tessdata
git clone git@github.com:tesseract-ocr/tessdata.git $DIR/tessdata
cd $DIR/tessdata
git checkout $TESSDATA_VERSION
git submodule init
git submodule update
ln -s $DIR/tessdata $BUILD_DIRECTORY/tesseract/$TESSERACT_VERSION/share/tessdata

# done
cd $DIR

attempt osd

Run the script on the image for the tesseract versions; tesseract 5.0.1 will raise TesseractError when using PIL.Image for the image argument to pytesseract.image_to_osd.

example

import pytesseract
from PIL import Image

image_filename = "/path/to/image"
pytesseract.pytesseract.tesseract_cmd = "/path/to/tesseract"

pil_image = Image.open(image_filename)
result = pytesseract.image_to_osd(image=pil_image)
print(result)
Traceback (most recent call last):
  File ".../reproducing_tesseract_bug.py", line 16, in <module>
    result = pytesseract.image_to_osd(image=pil_image)
  File ".../.venv/lib/python3.9/site-packages/pytesseract/pytesseract.py", line 545, in image_to_osd
    return {
  File ".../.venv/lib/python3.9/site-packages/pytesseract/pytesseract.py", line 548, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File ".../.venv/lib/python3.9/site-packages/pytesseract/pytesseract.py", line 286, in run_and_get_output
    run_tesseract(**kwargs)
  File ".../.venv/lib/python3.9/site-packages/pytesseract/pytesseract.py", line 262, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_k2k7ao6h loaded. Estimating resolution as 395 UZN file /tmp/tess_k2k7ao6h loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

Process finished with exit code 1

analysis

Using image file name as argument to image_to_osd:

Using PIL.Image as argument to image_to_osd:

caerulescens commented 2 years ago

@int3l could I get your first impression on what may be causing this issue?

bozhodimitrov commented 2 years ago

Hi @caerulescens , this is a duplicate of #408 I think. Can you confirm that the master version doesn't have this issue for you?

caerulescens commented 2 years ago

@int3l Confirmed that master branch does not have this issue. Thanks