ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 998 forks source link

Improve user experience for Windows 10 #455

Closed dibu28 closed 2 years ago

dibu28 commented 4 years ago

Hi

Describe the issue I've managed to run OCRmyPDF.exe on Windows 10 without wsl.

To Reproduce I've made fork and added some quick fixes in this commit: https://github.com/dibu28/OCRmyPDF/commit/543088e79e8649e968d02d8fd268123255607dc1

Fixes are: 1) in leptonica.py librray name is liblept-5 instead of lept 2) in ghostscript.py 2.1) executable name is gswin64c.exe instead of gs 2.2) NamedTemporaryFile doesnt work properly and gs could not modify tmp file with access denied error. (so as a temporary workaround I'm adding "_1" to temp file name and then removing file. There could be some better way) 3) in _pipeline.py and helpers.py files - symlinking to temp folder on windows requires Admin privelegies. So instead of simlinking I'm just copying files. 4) in _sync.py file - os.path.samefile is returning error: "OSError: [WinError 1] Incorrect function: 'nul'"

So after those changes and installin dependencies it started to work from command line like this: OCRmyPDF.exe input.pdf output.pdf

Dependencies and binaries I'm using: https://www.python.org/ftp/python/3.7.5/python-3.7.5-amd64.exe https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs950/gs950w64.exe https://github.com/qpdf/qpdf/releases/download/release-qpdf-9.0.2/qpdf-9.0.2-bin-msvc64.zip

Add paths to PATH variable: set PATH=%PATH%;C:\Program Files\Tesseract-OCR; set PATH=%PATH%;C:\Program Files\gs\gs9.50\bin\; set PATH=%PATH%;C:\qpdf\qpdf-9.0.2-bin-msvc64\qpdf-9.0.2\bin\;

python setup.py build
OCRmyPDF.exe input.pdf output.pdf

Expected behavior Can we add some workarounds using conditions based on os type?

System:

Additional context

jbarlow83 commented 4 years ago

Wow, nice... definitely interested in getting this merged.

I'd prefer to encapsulate changes and have a single source of truth.

For the named temporary file issue we can avoid Ghostscript temporary files entirely. I'll push a commit for you that improves that.

That sort of thing.

I'd like to get to 100% tests passing on Windows. (Of course we can skip platform specific tests.)

jbarlow83 commented 4 years ago

See the branch gs-temp-files for a commit that removes NamedTemporaryFile from ghostscript.py

jbarlow83 commented 4 years ago

I implemented the rest of the changes you suggested in a compatible way in the windows branch.

By any chance, do you know how to automate the installation of those packages (headless) for continuous integration?

dibu28 commented 4 years ago

It depends on which CI you use.

In simple words: 1) python-3.7.5-amd64.exe, tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe and gs950w64.exe are Windows installers they should have the command line option for "silent mode" but the option can be different, depending on the type of installer they use. (python-3.7.5-amd64.exe - is just a Python itself. If it will be allready installed then no need to insatll this).

2) qpdf-9.0.2-bin-msvc64.zip - is just а folder you should unzip and place it somewhere or if there is a Python package for it then just install it as dependency.

3) Add paths to PATH variable, so thet OCRmyPDF script can find all those executables.

dibu28 commented 4 years ago

Also. I've tried windows branch on my system and it is working.

jbarlow83 commented 4 years ago

The test suite is pretty far from passing unfortunately.

jbarlow83 commented 4 years ago

To elaborate, I was able to replicate what you set up and fixed a few things.

Some notes, more for myself:

bobastler commented 4 years ago

Really nice - but that leads me to the question: where can i get the windows exe file?

jbarlow83 commented 4 years ago

This is still in development (mainly limited by my available time to work on it) and it does not pass the test suite on Windows so user beware.

I believe if you do python setup.py build on a source directory Windows will build an exe. (@dibu28 do you know for sure?)

bobastler commented 4 years ago

I patched the files as described and no: it doesn't build an exe, .

jbarlow83 commented 4 years ago

python setup.py bdist --format=msi should make a windows .msi installer

jbarlow83 commented 4 years ago

Use the windows branch on this repo for the latest change set

bobastler commented 4 years ago

Yes, something was built, but when i run it:

Traceback (most recent call last):
  File "D:\xyz\OCRmyPDF\build\bdist.win-amd64\msi\Scripts\ocrmypdf-script.py", line 6, in <module>    from pkg_resources import load_entry_point
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3250, in <module>    @_call_aside
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3234, in _call_aside    f(*args, **kwargs)
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3263, in _initialize_master_working_set    working_set = WorkingSet._build_master()
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 583, in _build_master    ws.require(__requires__)
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 900, in require    needed = self.resolve(parse_requirements(requirements))
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 786, in resolve    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'ocrmypdf==9.1.0.post12+g3569da3' distribution was not found and is required by the application
jbarlow83 commented 4 years ago

Since it's not stable in the test suite yet I haven't even started to think about how to distribute it, but off the top of my head, try bypassing version management: Try removing setuptools_scm* from setup.py, manually setting the package version to something like 9.2.0a1 in setup.py, and rebuilding. Possibly reinstall into a virtual environment. This is a hackish workaround.

bobastler commented 4 years ago
D:\Python37\Scripts>ocrmypdf
Traceback (most recent call last):
  File "D:\Python37\Scripts\ocrmypdf-script.py", line 6, in <module>    from pkg_resources import load_entry_point
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3250, in <module>    @_call_aside
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3234, in _call_aside    f(*args, **kwargs)
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 3263, in _initialize_master_working_set    working_set = WorkingSet._build_master()
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 583, in _build_master    ws.require(__requires__)
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 900, in require    needed = self.resolve(parse_requirements(requirements))
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 786, in resolve    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'tqdm>=4' distribution was not found and is required by ocrmypdf

If i delete the install_requires for tqdm in setup.py - and even if i remove all of it - next error comes up:

    raise DistributionNotFound(req, requirers) 
pkg_resources.DistributionNotFound: The 'Pillow>=6.2.0' distribution was not found and is required by ocrmypdf
jbarlow83 commented 4 years ago

The Windows Subsystem for Linux version works quite well.

bobastler commented 4 years ago

That's no option in my environment, I use win 8.1 atm. Later usage would be in other environments and also: WSL is not an option there. It should be running native under Windows.

I'm programming a tool to create searchable pdf with powershell, tesseract and some other tools under Windows when i found your OCRmyPDF. So i thought: why reinvent the wheel ...

bobastler commented 4 years ago

After install all in another environment and install all needed python packages it stops with:


D:\Python37\Scripts>ocrmypdf.exe
Traceback (most recent call last):
  File "D:\Python37\Scripts\ocrmypdf-script.py", line 11, in <module>
    load_entry_point('ocrmypdf==0.0.0', 'console_scripts', 'ocrmypdf')()
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 2852, in load_entry_point
    return ep.load()
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 2443, in load
    return self.resolve()
  File "D:\Python37\lib\site-packages\pkg_resources\__init__.py", line 2449, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "D:\Python37\lib\site-packages\ocrmypdf\__init__.py", line 18, in <module>
    from . import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "D:\Python37\lib\site-packages\ocrmypdf\leptonica.py", line 46, in <module>
    lept = ffi.dlopen(find_library(libname))
OSError: cannot load library '<None>': error 0x57

which leads to https://github.com/jbarlow83/OCRmyPDF/issues/341

jbarlow83 commented 4 years ago

Is Leptonica installed?

Could also try copying liblept*.dll into D:\Python37\lib\site-packages\ocrmypdf, or the current directory.... I imagine Tesseract installs Leptonica.

bobastler commented 4 years ago

Have to setup system path and after a restart it runs. Now i'm testing.

Thanks

jbarlow83 commented 4 years ago

Good to hear. If you have any fixes please feel free to contribute.

bobastler commented 4 years ago

Have to test and build again and again to make it reproducable. Now it runs in a actual win10 environment but not under win8.1:

ocrmypdf-error01

jbarlow83 commented 4 years ago

That would be a problem for the packager of Tesseract for Windows to address.

If you run in debug mode with -k -v1 or -v2 you should be able to exact Tesseract command that fails and provide them with the .png from the temporary files folder.

You might be able to work around the error by manually compiling/installing libpng 1.6 and copying a DLL into place.

On Sat, Nov 23, 2019 at 5:27 PM bobastler notifications@github.com wrote:

Have to test and build again and again to make it reproducable. Now it runs in a actual win10 environment but not under win8.1:

[image: ocrmypdf-error01] https://user-images.githubusercontent.com/54740896/69487883-fc8df600-0e61-11ea-8fc8-a8a1cd8d6847.jpg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/455?email_source=notifications&email_token=AAN5YM4PRWBNSVQE2W2BNMTQVHKBTA5CNFSM4JMYDTLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFABQUA#issuecomment-557848656, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM7UHVLJXQNZN5QS773QVHKBTANCNFSM4JMYDTLA .

bobastler commented 4 years ago

It's the same Tesseract Installer. It runs under win10 but not under win81. It's late, 2:37 am, sarching tomorrow.

bobastler commented 4 years ago

One question: is there an option to remove blank pages?

jbarlow83 commented 4 years ago

No, I've long resisted adding that feature because of the risk of false positives. I've never found any scanner software that does it reliably without requiring an arbitrary threshold, and as the user you really have no idea how to set that except turn it up/down if it's giving you trouble. They can also be quite different in behavior on color vs grayscale vs black and white. You can get problems like poorly exposed color/gray getting rounded off to white. So in my opinion the state of the art for that feature is pretty poor. But if you know of something that addresses the problems I'll look.

ocrmypdf is designed to be as safe as possible so you can throw millions of pages at it and be confident it didn't lose any data.

dibu28 commented 4 years ago

1) @jbarlow83 After I execute python setup.py build the exe file is not available in dist folder. If I execute python setup.py install the exe file will be in Programs\Python\Python37\Scripts\ocrmypdf.exe and available in the PATH. It will use the code from: Programs\Python\Python37\Lib\site-packages\ocrmypdf-9.1.1-py3.7.egg\ocrmypdf

2) I also was able to build MSI installer using python setup.py bdist --format=msi as @jbarlow83 suggested.

3) If you get OSError: cannot load library '<None>': error 0x57 error you need to add tesseract folder to the PATH variable.

4) There is libjbig-2.dll in the tesseract installation. I don't know if you can use it.

dibu28 commented 4 years ago

It seems that choco install tesseract --pre is installing tesseract from the same source i've mentioned in the first post: https://github.com/UB-Mannheim/tesseract/wiki https://chocolatey.org/packages/tesseract#files

dibu28 commented 4 years ago

@jbarlow83 can you please tell me how to run tests?

jbarlow83 commented 4 years ago

In a ocrmypdf project folder:

pip install -r requirements/test.txt pytest -n auto

On Mon., Nov. 25, 2019, 04:53 dibu28, notifications@github.com wrote:

@jbarlow83 https://github.com/jbarlow83 can you please tell me how to run tests?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/455?email_source=notifications&email_token=AAN5YM5KFTTTVWMU76CGYKDQVPDELA5CNFSM4JMYDTLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFCJGHI#issuecomment-558142237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM23ZZAE4PK2WFYS37TQVPDELANCNFSM4JMYDTLA .

dibu28 commented 4 years ago

For pytest -n auto I've got this result: ========== 15 failed, 183 passed, 42 skipped, 2 xfailed in 179.37s (0:02:59) ===========

on branch windows

dibu28 commented 4 years ago

Two more tests passing: 13 failed, 185 passed, 42 skipped, 2 xfailed in 158.49s (0:02:38)

jbarlow83 commented 4 years ago

We're now at 100% tests passing. Took longer than I thought it would, and I was expecting it to take a long time.

dibu28 commented 4 years ago

Wow, nice. I've pulled windows branche and tried to run tests but I've got this result: == 3 failed, 90 passed, 40 skipped, 1 xfailed, 105 errors in 34.32s== Is it ok or am I missing something? The script itself is working.

jbarlow83 commented 4 years ago

I've been rebasing the changes to organize it more logically. So make sure you hard reset and force pull the windows branch. I also hadn't pushed a change or two when you commented. If anything is still broken after this, please attach the logs so I can look.

dibu28 commented 4 years ago

@jbarlow83 I've downloaded QPDF 9.1.0 and set path to point to it and now errors are gone: 22 failed, 173 passed, 40 skipped, 2 xfailed But now I have failed tests with message:

E         The program 'qpdf' could not be executed or was not found on your
E         system PATH.

But if I execute Qpdf if the path it is available:

qpdf.exe --version
qpdf version 9.0.2

I will attach logs later

jbarlow83 commented 4 years ago

@dibu28 QPDF is no longer required, provided that pikepdf binary wheels are used. Please try the v9.2.0 release.

dibu28 commented 4 years ago

@jbarlow83 I've reinsatlled all my dependencies including python and pulled latest v9.2.0. And now seems like tests passing. I've got only one failed: =1 failed, 200 passed, 39 skipped, 2 xfailed in 118.38s (0:01:58)= Is it correct?

As for Qpdf: If I only insatll OCRmyPDF with python setup.py insatll and try to run OCRmyPDF I've got error:

AppData\Local\Programs\Python\Python37\lib\site-packages\pikepdf-1.8.1-py3.7-win-amd64.egg\pikepdf\__init__.py", line 10, in <module>
from . import _qpdf
ImportError: DLL load failed: The specified module could not be found.

It seems thet pikepdf don't have all required DLL libraries in it's folder pikepdf-1.8.1-py3.7-win-amd64.egg\pikepdf there is only qpdf26.dll file So I've downloaded full qpdf package and put qpdf-9.1.0\bin in the PATH. There are also those files in it:

api-ms-win-crt-convert-l1-1-0.dll
api-ms-win-crt-environment-l1-1-0.dll
api-ms-win-crt-filesystem-l1-1-0.dll
api-ms-win-crt-heap-l1-1-0.dll
api-ms-win-crt-locale-l1-1-0.dll
api-ms-win-crt-math-l1-1-0.dll
api-ms-win-crt-runtime-l1-1-0.dll
api-ms-win-crt-stdio-l1-1-0.dll
api-ms-win-crt-string-l1-1-0.dll
api-ms-win-crt-time-l1-1-0.dll
api-ms-win-crt-utility-l1-1-0.dll
msvcp140.dll
qpdf.exe
qpdf26.dll
vcruntime140.dll
vcruntime140_1.dll
zlib-flate.exe

Now the only one test which have failed is:

__________________________________ test_bash __________________________________
[gw2] win32 -- Python 3.7.5 c:\users\d\appdata\local\programs\python\python37\python.exe
    def test_bash():
        try:
            proc = run(
                ['bash', '-n', 'misc/completion/ocrmypdf.bash'],
                check=True,
                encoding='utf-8',
                stdout=PIPE,
>               stderr=PIPE,
            )
tests\test_completion.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E               subprocess.CalledProcessError: Command '['bash', '-n', 'misc/completion/ocrmypdf.bash']' returned non-zero exit status 4294967295.
c:\users\d\appdata\local\programs\python\python37\lib\subprocess.py:512: CalledProcessError
jbarlow83 commented 4 years ago

Both of these changes have to do with me having more development tools than typical on my Windows box.

I take it that the VC14 runtime is a requirement or something along those lines.

Not too concerned about the bash test failing. This doesn't matter for native Windows.

On Thu., Dec. 12, 2019, 01:54 dibu28, notifications@github.com wrote:

@jbarlow83 https://github.com/jbarlow83 I've reinsatlled all my dependencies including python and pulled latest v9.2.0. And now seems like tests passing. I've got only one failed. Is it correct? : =1 failed, 200 passed, 39 skipped, 2 xfailed in 118.38s (0:01:58)=

As for Qpdf: If I only insatll OCRmyPDF with python setup.py insatll and try to run OCRmyPDF I've got error:

AppData\Local\Programs\Python\Python37\lib\site-packages\pikepdf-1.8.1-py3.7-win-amd64.egg\pikepdf__init__.py", line 10, in from . import _qpdf ImportError: DLL load failed: The specified module could not be found.

It seems thet pikepdf don't have all required DLL libraries in it's folder pikepdf-1.8.1-py3.7-win-amd64.egg\pikepdf there is only qpdf26.dll file So I've downloaded full qpdf package and put qpdf-9.1.0\bin in the PATH. There are also those files in it:

api-ms-win-crt-convert-l1-1-0.dll api-ms-win-crt-environment-l1-1-0.dll api-ms-win-crt-filesystem-l1-1-0.dll api-ms-win-crt-heap-l1-1-0.dll api-ms-win-crt-locale-l1-1-0.dll api-ms-win-crt-math-l1-1-0.dll api-ms-win-crt-runtime-l1-1-0.dll api-ms-win-crt-stdio-l1-1-0.dll api-ms-win-crt-string-l1-1-0.dll api-ms-win-crt-time-l1-1-0.dll api-ms-win-crt-utility-l1-1-0.dll msvcp140.dll qpdf.exe qpdf26.dll vcruntime140.dll vcruntime140_1.dll zlib-flate.exe

Now the only one test which have failed is:

__ test_bash __ [gw2] win32 -- Python 3.7.5 c:\users\d\appdata\local\programs\python\python37\python.exe def test_bash(): try: proc = run( ['bash', '-n', 'misc/completion/ocrmypdf.bash'], check=True, encoding='utf-8', stdout=PIPE,

          stderr=PIPE,

) tests\test_completion.py:49:


E subprocess.CalledProcessError: Command '['bash', '-n', 'misc/completion/ocrmypdf.bash']' returned non-zero exit status 4294967295. c:\users\d\appdata\local\programs\python\python37\lib\subprocess.py:512: CalledProcessError

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/455?email_source=notifications&email_token=AAN5YMYQZHCPYFFIM7WQODDQYIC4JA5CNFSM4JMYDTLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGWDG4I#issuecomment-564933489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM465LFOVDT5VGMVYS3QYIC4JANCNFSM4JMYDTLA .

dibu28 commented 4 years ago

@jbarlow83 Executed tests again on the clean setup. Now it seems that all test are passing: = 199 passed, 40 skipped, 3 xfailed in 299.36s (0:04:59) =

nQk2 commented 4 years ago

Unfortunately it's not working on my Windows 10 machine. First I get a message box saying

The procedure entry point inflateReset2 could not be located in the dynamic link library C:\Program Files\Tesseract-OCR\libpng16-16.dll.

Then the console says

Traceback (most recent call last): File "c:\users\david\appdata\local\programs\python\python38\lib\runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\users\david\appdata\local\programs\python\python38\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\David\AppData\Local\Programs\Python\Python38\Scripts\ocrmypdf.exe__main.py", line 4, in File "c:\users\david\appdata\local\programs\python\python38\lib\site-packages\ocrmypdf\init__.py", line 18, in from . import helpers, hocrtransform, leptonica, pdfa, pdfinfo File "c:\users\david\appdata\local\programs\python\python38\lib\site-packages\ocrmypdf\leptonica.py", line 61, in lept = ffi.dlopen(_libpath) OSError: cannot load library 'C:\Program Files\Tesseract-OCR\liblept-5.dll': error 0x7f

Could this be 32 vs. 64 bit related? I first had Python 32-bit installed. Then I got ...error 0xc1

Then I uninstalled python and installed Python 64-bit but getting ...error 0x7f

dibu28 commented 4 years ago

@nQk2 try python 3.7.5. I've also had problems with 3.8. And make shure you've added Tesseract-OCR, gs9.50\bin, and qpdf-9.1.0\bin to your PATH variable. And ;.PY to PATHEXT variable

jbarlow83 commented 4 years ago

@nQk2 All of the components must be the same bitness and really should be 64-bit. It's not hard for a program that aggressively uses all available CPU power to run into the 2GB memory wall on 32-bit Windows.

It definitely won't work to interface 32-bit Python to a 64-bit library which is probably the cause of that stacktrace.

jbarlow83 commented 4 years ago

@dibu28 It should not be necessary to put qpdf-...\bin in the PATH anymore. Eventually the other two won't be needed either.

dibu28 commented 4 years ago

@jbarlow83 You will pack tesseract and gs as python packages? (Windows versions)

jbarlow83 commented 4 years ago

The first step will be for ocrmypdf to check in reasonable locations for Tesseract and GS, examining the registry or whatever, so PATH becomes an override.

I don't believe I can bundle the GS installer unless I change OCRmyPDF to AGPL, and I'm not sure I want to do that. I believe everything else could be bundled.

As far as actually doing a Windows installer, bundling, or setting up a choco package, I am hoping the community will step up, because I haven't done made a Windows installer before or tried to package a Python application for Windows, and other people probably know how to get this off the ground faster than I can even if I end up finishing it. I converted to Azure Pipelines for its better Windows support, so that ideally we can test and deploy for every distribution type in one shot.

ocrmypdf is a unique/more complex case in its use of Leptonica (ABI level binding to a C library) and relies on calls to third party non-Python binaries. It will probably be necessary to spin off Leptonica into a separate package that gets compiled as a binary wheel, something I've already started work on actually. That means installer-generator programs that try to inspect the source code for dependencies are probably going to fail, because usually look for Python-only dependencies.

santiago-afonso commented 4 years ago

I don't know if this helps, as I'm not knowleadgeable enough, but I can't get it to run using the exact instructions currently on the documentation. Btw, thank you all, specially the maintainer, for the hard work.

The paths for tesseract and gs have been added. First I got the libcurl-4 is missing error (plus 3 other dlls). Then I installed libcurl from chocolatey and manually installed qpdf to the folder that the first comment specified (https://github.com/jbarlow83/OCRmyPDF/issues/455#issue-522103851). The current situation can be seen next; I don't know where to get pikepdf from. All are 64-bit versions. I'm running Windows 10 1909.

C:\WINDOWS\system32>choco list --local-only
Chocolatey v0.10.15
chocolatey 0.10.15
chocolatey-core.extension 1.3.5.1
curl 7.67.0
Ghostscript 9.50
Ghostscript.app 9.50
pngquant 2.12.3
python3 3.8.1
tesseract 5.0.0.20191030-alpha
8 packages installed.

C:\WINDOWS\system32>ocrmypdf
Traceback (most recent call last):
  File "c:\python38\lib\site-packages\pikepdf\__init__.py", line 10, in <module>
    from . import _qpdf
ImportError: DLL load failed while importing _qpdf: The specified module could not be found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\python38\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python38\Scripts\ocrmypdf.exe\__main__.py", line 5, in <module>
  File "c:\python38\lib\site-packages\ocrmypdf\__init__.py", line 18, in <module>
    from . import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "c:\python38\lib\site-packages\ocrmypdf\pdfa.py", line 38, in <module>
    import pikepdf
  File "c:\python38\lib\site-packages\pikepdf\__init__.py", line 12, in <module>
    raise ImportError("pikepdf's extension library failed to import")
ImportError: pikepdf's extension library failed to import
santiago-afonso commented 4 years ago

Also, while I can successfully install 9.2.0 on Ubuntu 18.04 under WSL, when trying to access it from the command line (https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows-subsystem-for-linux) I get the old version for some reason:

C:\WINDOWS\system32>wsl sudo ln -s  /home/user/.local/bin/ocrmypdf /usr/local/bin/ocrmypdf
[sudo] password for USERNAME:

C:\WINDOWS\system32>wsl ocrmypdf --version
6.1.2
jbarlow83 commented 4 years ago

@osnofas The trouble is likely that I've been working with a Windows 10 image with a lot of developer things on it already so it's not the best test environment.

Could you run this command and send the results? This should just print a list of files installed for pikepdf:

dir /s c:\python38\lib\site-packages\pikepdf

Also what version of pip is installed? (pip --version and python -m pip --version)

Regarding WSL, you'll need to ensure that /home/user/.local/bin is added to the WSL system PATH environment variable.

santiago-afonso commented 4 years ago

On Ubuntu/WSL:

:/mnt/c/WINDOWS/system32$ pip --version
pip 19.3.1 from /home/USERNAME/.local/lib/python3.6/site-packages/pip (python 3.6)
:/mnt/c/WINDOWS/system32$ python -m pip --version
pip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)

On Windows proper:

C:\Python38\Lib\site-packages>dir
 Volume in drive C is Windows
 Volume Serial Number is B23D-AC41

 Directory of C:\Python38\Lib\site-packages

23/12/2019  20:09    <DIR>          .
23/12/2019  20:09    <DIR>          ..
23/12/2019  20:09               126 easy_install.py
23/12/2019  20:09    <DIR>          pip
23/12/2019  20:09    <DIR>          pip-19.2.3.dist-info
23/12/2019  20:09    <DIR>          pkg_resources
18/12/2019  23:26               121 README.txt
23/12/2019  20:09    <DIR>          setuptools
23/12/2019  20:09    <DIR>          setuptools-41.2.0.dist-info
23/12/2019  20:09    <DIR>          __pycache__
               2 File(s)            247 bytes
               8 Dir(s)  101.128.069.120 bytes free

C:\Python38\Lib\site-packages>

I installed python with chocolatey. Regardless, I have another python distro on Windows for use with anaconda and it doesn't have pikepdf either. (It appears that overall I have at least 4 python installations between Windows and WSL/Ubuntu.)