madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.84k stars 721 forks source link

IMAGE_TO_OSD fails with OPENCV or PIL Image but works fine if the image file path is passed as String #431

Closed eafaizal closed 2 years ago

eafaizal commented 2 years ago

Tesseract Version : tesseract v5.0.1.20220118 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

PyTesseract Version 0.3.9

Windows 10 OS

Code Snippet img_path= r'C:\Users\test\Downloads\test_resize.jpg' img=cv2.imread(img_path) rot_data = pytesseract.image_to_osd(img); rot = re.search('(?<=Rotate: )\d+', rot_data).group(0) angle = float(rot)

Error ----> 4 rot_data = pytesseract.image_to_osd(img); 5 #print("[OSD] "+rot_data) 6 rot = re.search('(?<=Rotate: )\d+', rot_data).group(0)

File c:\Users\test.conda\envs\invoice\lib\site-packages\pytesseract\pytesseract.py:545, in image_to_osd(image, lang, config, nice, output_type, timeout) 542 config = f'--psm 0 {config.strip()}' 543 args = [image, 'osd', lang, config, nice, timeout] --> 545 return { 546 Output.BYTES: lambda: run_and_get_output((args + [True])), 547 Output.DICT: lambda: osd_to_dict(run_and_get_output(args)), 548 Output.STRING: lambda: run_and_get_output(*args), 549 }[output_type]()

File c:\Users\test.conda\envs\invoice\lib\site-packages\pytesseract\pytesseract.py:548, in image_to_osd..() 542 config = f'--psm 0 {config.strip()}' 543 args = [image, 'osd', lang, config, nice, timeout] 545 return { 546 Output.BYTES: lambda: run_and_get_output((args + [True])), 547 Output.DICT: lambda: osd_to_dict(run_and_get_output(args)), --> 548 Output.STRING: lambda: run_and_get_output(*args), 549 }[output_type]()

275 with save(image) as (temp_name, input_filename):
276     kwargs = {
277         'input_filename': input_filename,
278         'output_filename_base': temp_name,

(...) 283 'timeout': timeout, 284 } --> 286 run_tesseract(**kwargs) 287 filename = kwargs['output_filename_base'] + extsep + extension 288 with open(filename, 'rb') as output_file:

output_filename_base, extension, lang, config, nice, timeout) 260 with timeout_manager(proc, timeout) as error_string: 261 if proc.returncode: --> 262 raise TesseractError(proc.returncode, get_errors(error_string))

TesseractError: (1, 'UZN file C:\Users\test\AppData\Local\Temp\tess_3lsvqhn6 loaded. Estimating resolution as 152 UZN file C:\Users\test\AppData\Local\Temp\tess_3lsvqhn6 loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

But if the file path is passed as string to image_to_osd function than correct result is outputted

image

Attached the image used for this test.

bozhodimitrov commented 2 years ago

This is because pytesseract does internal conversion (pre-processing) and then sends the image datа to tesseract via saving that datа to file. If you want to avoid that internal conversion - you should save the image datа by yourself if you don't have the image file already and then pass it to tesseract as string path.

This behavior is very old and legacy way of passing stuff to tesseract itself. And I think that there is a request for passing the image data via pipes to newer versions of tesseract already in order to avoid such conversions if possible.

eafaizal commented 2 years ago

Thanks for the quick response