Closed eafaizal closed 2 years ago
This is because pytesseract does internal conversion (pre-processing) and then sends the image datа to tesseract via saving that datа to file. If you want to avoid that internal conversion - you should save the image datа by yourself if you don't have the image file already and then pass it to tesseract as string path.
This behavior is very old and legacy way of passing stuff to tesseract itself. And I think that there is a request for passing the image data via pipes to newer versions of tesseract already in order to avoid such conversions if possible.
Thanks for the quick response
Tesseract Version : tesseract v5.0.1.20220118 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
PyTesseract Version 0.3.9
Windows 10 OS
Code Snippet img_path= r'C:\Users\test\Downloads\test_resize.jpg' img=cv2.imread(img_path) rot_data = pytesseract.image_to_osd(img); rot = re.search('(?<=Rotate: )\d+', rot_data).group(0) angle = float(rot)
Error ----> 4 rot_data = pytesseract.image_to_osd(img); 5 #print("[OSD] "+rot_data) 6 rot = re.search('(?<=Rotate: )\d+', rot_data).group(0)
File c:\Users\test.conda\envs\invoice\lib\site-packages\pytesseract\pytesseract.py:545, in image_to_osd(image, lang, config, nice, output_type, timeout) 542 config = f'--psm 0 {config.strip()}' 543 args = [image, 'osd', lang, config, nice, timeout] --> 545 return { 546 Output.BYTES: lambda: run_and_get_output((args + [True])), 547 Output.DICT: lambda: osd_to_dict(run_and_get_output(args)), 548 Output.STRING: lambda: run_and_get_output(*args), 549 }[output_type]()
File c:\Users\test.conda\envs\invoice\lib\site-packages\pytesseract\pytesseract.py:548, in image_to_osd..()
542 config = f'--psm 0 {config.strip()}'
543 args = [image, 'osd', lang, config, nice, timeout]
545 return {
546 Output.BYTES: lambda: run_and_get_output((args + [True])),
547 Output.DICT: lambda: osd_to_dict(run_and_get_output(args)),
--> 548 Output.STRING: lambda: run_and_get_output(*args),
549 }[output_type]()
(...) 283 'timeout': timeout, 284 } --> 286 run_tesseract(**kwargs) 287 filename = kwargs['output_filename_base'] + extsep + extension 288 with open(filename, 'rb') as output_file:
output_filename_base, extension, lang, config, nice, timeout) 260 with timeout_manager(proc, timeout) as error_string: 261 if proc.returncode: --> 262 raise TesseractError(proc.returncode, get_errors(error_string))
TesseractError: (1, 'UZN file C:\Users\test\AppData\Local\Temp\tess_3lsvqhn6 loaded. Estimating resolution as 152 UZN file C:\Users\test\AppData\Local\Temp\tess_3lsvqhn6 loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')
But if the file path is passed as string to image_to_osd function than correct result is outputted
Attached the image used for this test.