good morning,
I'm testing OCR (see attached program) with an error "No OCR support: TESSDATA_PREFIX not set".
At the same time, I tested pytesseract with the same image (see attached file).
As mentioned in the documentation: manipulating os.environ is not applicable.
Use the tessdata parameter to provide the folder name if you haven't provided it outside (!) the script.
Description of the bug
good morning, I'm testing OCR (see attached program) with an error "No OCR support: TESSDATA_PREFIX not set". At the same time, I tested pytesseract with the same image (see attached file).
Thanks for your feedback.
import fitz import time, os import pandas as pd
print(fitz.doc) os.environ["TESSDATA_PREFIX"] = '/opt/local/share/tessdata/' print()
fichier = "/Users/yves/documents_1/Test_liasse_pytesseract/image.png"
mat = fitz.Matrix(5, 5) # high resolution matrix ocr_time = 0 pix_time = 0 INVALID_UNICODE = chr(0xFFFD)
doc = fitz.open()
pix = fitz.Pixmap(fichier) imgpdf = fitz.open("pdf", pix.pdfocr_tobytes()) doc.insert_pdf(imgpdf) pix = None imgpdf.close() doc.save(path+"ocr-images.pdf")
for page in doc: text = page.get_text("text") print(text)
Pytesseract : from pytesseract import * import os import pandas as pd
os.environ["TESSDATA_PREFIX"] = '/opt/local/share/tessdata/' pytesseract.tesseract_cmd = r"/opt/local/bin/tesseract" custom_config = r' --psm 6 -c preserve_interword_spaces=1' image = "/Users/yves/documents_1/Test_liasse_pytesseract/image.png" d = pytesseract.image_to_data(image, config=custom_config,output_type='data.frame') test1 = pytesseract.image_to_string(image ,lang='fra', config = custom_config)
How to reproduce the bug
python3.9 -m install PyMupdf
PyMuPDF version
1.23.x or earlier
Operating system
MacOS
Python version
3.9