ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 998 forks source link

Improve user experience for Windows 10 #455

Closed dibu28 closed 2 years ago

dibu28 commented 4 years ago

Hi

Describe the issue I've managed to run OCRmyPDF.exe on Windows 10 without wsl.

To Reproduce I've made fork and added some quick fixes in this commit: https://github.com/dibu28/OCRmyPDF/commit/543088e79e8649e968d02d8fd268123255607dc1

Fixes are: 1) in leptonica.py librray name is liblept-5 instead of lept 2) in ghostscript.py 2.1) executable name is gswin64c.exe instead of gs 2.2) NamedTemporaryFile doesnt work properly and gs could not modify tmp file with access denied error. (so as a temporary workaround I'm adding "_1" to temp file name and then removing file. There could be some better way) 3) in _pipeline.py and helpers.py files - symlinking to temp folder on windows requires Admin privelegies. So instead of simlinking I'm just copying files. 4) in _sync.py file - os.path.samefile is returning error: "OSError: [WinError 1] Incorrect function: 'nul'"

So after those changes and installin dependencies it started to work from command line like this: OCRmyPDF.exe input.pdf output.pdf

Dependencies and binaries I'm using: https://www.python.org/ftp/python/3.7.5/python-3.7.5-amd64.exe https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs950/gs950w64.exe https://github.com/qpdf/qpdf/releases/download/release-qpdf-9.0.2/qpdf-9.0.2-bin-msvc64.zip

Add paths to PATH variable: set PATH=%PATH%;C:\Program Files\Tesseract-OCR; set PATH=%PATH%;C:\Program Files\gs\gs9.50\bin\; set PATH=%PATH%;C:\qpdf\qpdf-9.0.2-bin-msvc64\qpdf-9.0.2\bin\;

python setup.py build
OCRmyPDF.exe input.pdf output.pdf

Expected behavior Can we add some workarounds using conditions based on os type?

System:

Additional context

jbarlow83 commented 4 years ago

@osnofas You'll have to be careful to ensure that you install ocrmypdf to the same native Windows Python distribution that you want to run it. The directory listing of "Windows proper" shows ocrmypdf was installed to a different distribution.

You may need to create a virtual environment and install ocrmypdf to there.

ajweber commented 4 years ago

Trying to get this working. unpaper is not found, and choco does not appear to have a package for it.

Anyone have tips/tricks to install 64bit windows binary for unpaper to support --clean option?

FWIW: I found a pre-built 6.2 binary, but that didn't work. It runs (and shows me a version), but ocrmypdf dumps a stack-trace when it tries to use it. (I put it in my PATH)

joseavegaa commented 4 years ago

Hey! I know this is an old thread but I just installed Python 3.8, Tesseract 5.0.0, Ghostscript, pngquant, and ocrmypdf. When I execute ocrmypdf --help, I get the following:

  File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\javegaa\AppData\Local\Programs\Python\Python38-32\Scripts\ocrmypdf.exe\__main__.py", line 5, in <module>
  File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\site-packages\ocrmypdf\__init__.py", line 18, in <module>
    from . import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\site-packages\ocrmypdf\leptonica.py", line 70, in <module>
    lept = ffi.dlopen(_libpath)
OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0xc1

Thanks for the help, really appreciate this.

jbarlow83 commented 4 years ago

You installed 32-bit Python and 64-bit Tesseract, and these can't interface.

Use 64-bit Python instead.

On Mon., May 25, 2020, 13:32 Jose A. Vega, notifications@github.com wrote:

Hey! I know this is an old thread but I just installed Python 3.8, Tesseract 5.0.0, Ghostscript, pngquant, and ocrmypdf. When I execute ocrmypdf --help, I get the following:

File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\javegaa\AppData\Local\Programs\Python\Python38-32\Scripts\ocrmypdf.exe__main.py", line 5, in File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\site-packages\ocrmypdf\init__.py", line 18, in from . import helpers, hocrtransform, leptonica, pdfa, pdfinfo File "c:\users\javegaa\appdata\local\programs\python\python38-32\lib\site-packages\ocrmypdf\leptonica.py", line 70, in lept = ffi.dlopen(_libpath) OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0xc1

Thanks for the help, really appreciate this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/455#issuecomment-633708439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YMYPE5Y2PBKYW7GGRLTRTLIWNANCNFSM4JMYDTLA .

GouravDataAnalyst commented 4 years ago

import ocrmypdf Traceback (most recent call last):

File "", line 1, in import ocrmypdf

File "C:\Users\22252\AppData\Roaming\Python\Python38\site-packages\ocrmypdf__init__.py", line 10, in from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo

File "C:\Users\22252\AppData\Roaming\Python\Python38\site-packages\ocrmypdf\leptonica.py", line 62, in lept = ffi.dlopen(_libpath) OSError: cannot load library 'D:\OCR\Tesseract-OCR\liblept-5.dll': error 0x7f

Please let me know how to fix this ??

StudioEtrange commented 3 years ago

when installing tesseract from conda package, leptonica is installed but on windows the name is leptonica-x.x.x.dll not the way it is spelled in leptonica.py

Maybe instead of list all fashion of how leptonica lib could be written, is there another way to test it ? I do not know

but now, on windows I can not use ocrmypdf on windows with conda or mamba env.

https://anaconda.org/conda-forge/leptonica https://github.com/conda-forge/leptonica-feedstock

jbarlow83 commented 2 years ago

leptonica is not longer a dependency - this should resolve the remaining Windows issues.