pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

No OCR support: TESSDATA_PREFIX not set #3609

Closed Larbo53 closed 6 days ago

Larbo53 commented 6 days ago

Description of the bug

good morning, I'm testing OCR (see attached program) with an error "No OCR support: TESSDATA_PREFIX not set". At the same time, I tested pytesseract with the same image (see attached file).

Thanks for your feedback.

import fitz import time, os import pandas as pd

print(fitz.doc) os.environ["TESSDATA_PREFIX"] = '/opt/local/share/tessdata/' print()

fichier = "/Users/yves/documents_1/Test_liasse_pytesseract/image.png"

mat = fitz.Matrix(5, 5) # high resolution matrix ocr_time = 0 pix_time = 0 INVALID_UNICODE = chr(0xFFFD)

doc = fitz.open()

pix = fitz.Pixmap(fichier) imgpdf = fitz.open("pdf", pix.pdfocr_tobytes()) doc.insert_pdf(imgpdf) pix = None imgpdf.close() doc.save(path+"ocr-images.pdf")

for page in doc: text = page.get_text("text") print(text)

Pytesseract : from pytesseract import * import os import pandas as pd

os.environ["TESSDATA_PREFIX"] = '/opt/local/share/tessdata/' pytesseract.tesseract_cmd = r"/opt/local/bin/tesseract" custom_config = r' --psm 6 -c preserve_interword_spaces=1' image = "/Users/yves/documents_1/Test_liasse_pytesseract/image.png" d = pytesseract.image_to_data(image, config=custom_config,output_type='data.frame') test1 = pytesseract.image_to_string(image ,lang='fra', config = custom_config)

How to reproduce the bug

python3.9 -m install PyMupdf

PyMuPDF version

1.23.x or earlier

Operating system

MacOS

Python version

3.9

JorjMcKie commented 6 days ago

As mentioned in the documentation: manipulating os.environ is not applicable. Use the tessdata parameter to provide the folder name if you haven't provided it outside (!) the script.