jalan / pdftotext

Simple PDF text extraction
MIT License
870 stars 99 forks source link

Cannot install on Windows #16

Open geauxtigers opened 6 years ago

geauxtigers commented 6 years ago

I am running Win10 with the anaconda dist of python 3.6 and have the MS build tools and compiler installed. I pip install the pdftotext package. Installation begins and then terminates with this message:

pdftotext.cpp(3): fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory

Any ideas?

jalan commented 6 years ago

You would need to get poppler and its development files installed on Windows. I don't use Windows, so I am not much help, sorry.

If you figure something out, I will gladly add it to the README here!

geauxtigers commented 6 years ago

Thanks. I am researching this but have not found any good guidance. I will report back my findings.

randiaz95 commented 6 years ago

Any updates on this problem? I happen to have this same thing.

geauxtigers commented 6 years ago

No, I didn’t. My solution ended up being in a completely different direction, using different packages. This may be worth a mention in the README file for future Windows (10) users.

rcy222 commented 6 years ago

All hope is not lost on the windows version. There is a command line utility with the same name and you can use the subprocess package to execute pdftotext

PDFtotext windows download instruction, credit @s2t2

  1. Go to https://www.xpdfreader.com/download.html and click "Download the Xpdf tools"
  2. Uncompress/extract the zip file, and move the folder to a location like the Desktop or the Programs directory.
  3. Inside the unzipped folder, copy the file bin64/pdftotext.exe into your project repository
davidolmo commented 6 years ago

3. Inside the unzipped folder, copy the file bin64/pdftotext.exe into your project repository

You mean that it can be used from windows command prompt but not python?

GuodongQi commented 5 years ago
  1. Inside the unzipped folder, copy the file bin64/pdftotext.exe into your project repository

You mean that it can be used from windows command prompt but not python?

just copy files bin64/*.exe to your PYTHON PATH directory,then it can be used both cmd and python shell

Mattwmaster58 commented 5 years ago

Any chance of prebuilt binaries being offered? Is it something that could be integrated into the CI setup? I think we'd need to build windows binaries on windows though, so moving to Appveyor would be required, unfortunately. Maybe a solution like cibuildwheel can help with that. I tried to, and failed miserably, at building it on windows.

Considering the only alternative at the moment (pdfminer and it's deraritives), which is super slow (4 orders of magnitude in my experience), inaccurate results which are in some cases impossible to parse accurately, I think It'd be great to offer prebuilt binaries with this functionality.

woodsjs commented 5 years ago

I "ugly" installed pdftotext successfully on windows three times over the past two days, as the subprocess method is a non-starter for me. I have a writeup on SO

https://stackoverflow.com/questions/45912641/unable-to-install-pdftotext-on-python-3-6-missing-poppler/58139729#58139729

as well as on my blog, which has screenshots

https://coder.haus/2019/09/27/installing-pdftotext-through-pip-on-windows-10/

Please try this, let me know if it works. I'm hoping to take time to do it properly and potentially generate a PR.

My solution requires Anaconda (for conda install). First, install Microsoft VC++ build tools, download poppler for windows as well as conda install poppler, and copy some of the poppler files to different locations in the Anaconda directory structure. Again, I have done this a total of 3 times and know it can be done better, but this will get you up and running.

woodsjs commented 5 years ago

The following fixes the issue on Windows 10.

Assumes MS VC++ Build Tools is installed. Assumes Anaconda is being used. Assumes Poppler is installed using conda install poppler.

The code update is in setup.py -

from os import path, getenv

if platform.system() in ['Windows']:
    conda_dir = getenv('CONDA_PREFIX')
    anaconda_poppler_include_dir = path.join(conda_dir, 'Library\include')
    anaconda_poppler_library_dir = path.join(conda_dir, 'Library\lib')
    include_dirs = [anaconda_poppler_include_dir]
    library_dirs = [anaconda_poppler_library_dir]

pip install completes successfully and the unit tests run successfully.

Let me know if this looks sane, and if I should create a PR for this.

zacps commented 5 years ago

I needed to apply this diff, then it worked for me:

diff --git a/setup.py b/setup.py
index 4c0e861..8a7337a 100644
--- a/setup.py
+++ b/setup.py
@@ -29,8 +29,8 @@ elif platform.system() in ["Windows"]:
         print("ERROR: CONDA_PREFIX is not found.")
         print("       Install Anaconda or fix missing CONDA_PREFIX and try again.")
         sys.exit(1)
-    anaconda_poppler_include_dir = path.join(conda_dir, "Library\include")
-    anaconda_poppler_library_dir = path.join(conda_dir, "Library\lib")
+    anaconda_poppler_include_dir = path.join(conda_dir, r"Library\include")
+    anaconda_poppler_library_dir = path.join(conda_dir, r"Library\lib")
     include_dirs = [anaconda_poppler_include_dir]
     library_dirs = [anaconda_poppler_library_dir]
 else:
jalan commented 5 years ago

@zacps I don't understand what your diff changes. Does it just change the strings into raw strings, or am I missing something? There are no escape characters, so aren't both versions equal?

$ python
>>> r"Library\include" == "Library\include"
True
zacps commented 5 years ago

\i and \l are invalid escape sequences, but now that I think about it I'm not sure why it changed anything.

Kagigz commented 4 years ago

Thank you so much @woodsjs for your blog post!

Just a quick update I would add: When running pip install it's looking at the files inside my python3.6 libs folder and not the conda libs folder, so for me it worked when I copied the poppler-cpp.lib file into the AppData/Local/Programs/Python/Python36/libs folder.

woodsjs commented 4 years ago

Thanks for this, @Kagigz! I've updated the post to point to the possibility of the file living in the Python/Python{PythonVersion}/libs directory.

jalan commented 4 years ago

@woodsjs I finally set up appveyor and pushed a release including your changes from #47: https://pypi.org/project/pdftotext/2.1.3/ . I hope it's nicer for Windows/conda users now. Thanks again!

GadgetSteve commented 4 years ago

@jalan It looks like you have yet to upload the Windows binaries to pypi so Windows users can get them!

jalan commented 4 years ago

@GadgetSteve ?

I didn't say I was planning to upload any Windows binaries

GadgetSteve commented 4 years ago

" I finally set up appveyor and pushed a release including your changes from #47: https://pypi.org/project/pdftotext/2.1.3/ . I hope it's nicer for Windows/conda users now."

If you are not planning to upload any Windows binaries then the improvement for windows users is minimal - most do not have a compiler installed.

palakjadwani commented 3 years ago

I have had the hardest time installing pdftotext on my windows 10. Found @woodsjs page very useful as those were the exact errors I got. Finally I got it successfully installed but my PyCharm is not recognizing and giving : ImportError: DLL load failed while importing pdftotext: The specified module could not be found.

I see inside site-packages that pip install doesn't download the jar file, maybe that is the problem. Please please someone help me out. Spent 2 days on this thing. @jalan

GadgetSteve commented 3 years ago

@palakjadwani this is the problem with there not being a binary download for the pdtotext package - you can find some workarounds at http://faculty.washington.edu/jwilker/559/2018/pdftotext.pdf but they are less than ideal.

woodsjs commented 3 years ago

@palakjadwani If you can see the package using the anaconda prompt (you can, correct?), this more than likely has NOTHING to do with the install of pdftotext. This is most likely because your pycharm install is not using the interpreter, conda, that you installed pdftotext under.

See Jetbrains documentation on changing your interpreter https://www.jetbrains.com/help/pycharm/configuring-python-interpreter.html?_ga=2.75189201.2125986240.1606050657-56391306.1606050657#interpreter You would add your existing anaconda environment.

I've tested this, as I do not use PyCharm, and was able to add my anaconda environment and get access to pdftotext. I would also suggest shutting pycharm down and restarting to make the new interpreter active.

GodMakesMe commented 3 years ago

I have the wheel file of the pdftotext for cp38 version 3.8.5 64 bit. Just go to power shell and do cd [Location of the file] pip install ./[Wheel file [name] Or py -3.8 -m pip install ./[Wheel file name]

I am also attaching poppler files so you can extract these files in python destination folder. pdftotext.zip

GodMakesMe commented 3 years ago

I have the wheel file of the pdftotext for cp38 version 3.8.5 64 bit. Just go to power shell and do cd [Location of the file] pip install ./[Wheel file [name] Or py -3.8 -m pip install ./[Wheel file name]

I am also attaching poppler files so you can extract these files in python destination folder. pdftotext.zip

If any one have the wheel of pdftotext of version cp39 64 bit kindly share it.

GodMakesMe commented 3 years ago

Now I have the latest wheel file. version 39 64 Bit pdftotext.zip

polyspastos commented 3 years ago

Hey!

I have come across the following while trying to install it:

(venv) (dir)>python -m pip install ./pdftotext-2.1.5-cp39-cp39-win_amd64.whl
Processing (dir)\pdftotext-2.1.5-cp39-cp39-win_amd64.whl
Installing collected packages: pdftotext
Successfully installed pdftotext-2.1.5

(venv) (dir)>python (script).py
Traceback (most recent call last):
  File "(dir)\(script).py", line 11, in <module>
    import pdftotext
ImportError: DLL load failed while importing pdftotext: The specified module could not be found.

(venv) (dir)>pip freeze
(misc modules)
pdftotext @ file:///(drive):/(dir)/pdftotext-2.1.5-cp39-cp39-win_amd64.whl
(misc modules)

Do you have any idea what could be causing this?

Any help would be greatly appreciated.

GodMakesMe commented 3 years ago

There is possibly the error in the environment. Try to run it in anaconda Or install it in anaconda prompt. If the problem persists then you can use PyPDF2 as the pdf extractor. Else There could be issues in the installation of poppler. I am also finding the solution. You can also refer to the Coder Haus blog on this. If you find the solution then please let me know.

On Mon, 18 Jan 2021 at 16:32, polyspastos notifications@github.com wrote:

Hey!

I have come across the following while trying to install it:

(venv) (dir)>python -m pip install ./pdftotext-2.1.5-cp39-cp39-win_amd64.whl Processing (dir)\pdftotext-2.1.5-cp39-cp39-win_amd64.whl Installing collected packages: pdftotext Successfully installed pdftotext-2.1.5

(venv) (dir)>python (script).py Traceback (most recent call last): File "(dir)(script).py", line 11, in import pdftotext ImportError: DLL load failed while importing pdftotext: The specified module could not be found.

(venv) (dir)>pip freeze (misc modules) pdftotext @ file:///(drive):/(dir)/pdftotext-2.1.5-cp39-cp39-win_amd64.whl (misc modules)

Do you have any idea what could be causing this?

Any help would be greatly appreciated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jalan/pdftotext/issues/16#issuecomment-762172363, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASLD7K2H27CNGAE63DBLUNTS2QILHANCNFSM4EV2LW6A .

GadgetSteve commented 3 years ago

I suspect that you need to have the poppler bin directory on your path for it to work. This is probably why the pdftotext.zip that had the wheel in also had poppler-0.68.0_x86 (1).7z in it. You will need to use 7-zip to extract this and add the_location_that_you_extract_to\bin to your path.

Hope that helps.

From: GodMakesMe notifications@github.com Sent: 18 January 2021 12:26 To: jalan/pdftotext pdftotext@noreply.github.com Cc: Steve (Gadget) Barnes gadgetsteve@hotmail.com; Mention mention@noreply.github.com Subject: Re: [jalan/pdftotext] Cannot install on Windows (#16)

There is possibly the error in the environment. Try to run it in anaconda Or install it in anaconda prompt. If the problem persists then you can use PyPDF2 as the pdf extractor. Else There could be issues in the installation of poppler. I am also finding the solution. You can also refer to the Coder Haus blog on this. If you find the solution then please let me know.

On Mon, 18 Jan 2021 at 16:32, polyspastos notifications@github.com<mailto:notifications@github.com> wrote:

Hey!

I have come across the following while trying to install it:

(venv) (dir)>python -m pip install ./pdftotext-2.1.5-cp39-cp39-win_amd64.whl Processing (dir)\pdftotext-2.1.5-cp39-cp39-win_amd64.whl Installing collected packages: pdftotext Successfully installed pdftotext-2.1.5

(venv) (dir)>python (script).py Traceback (most recent call last): File "(dir)(script).py", line 11, in import pdftotext ImportError: DLL load failed while importing pdftotext: The specified module could not be found.

(venv) (dir)>pip freeze (misc modules) pdftotext @ file:///(drive):/(dir)/pdftotext-2.1.5-cp39-cp39-win_amd64.whl (misc modules)

Do you have any idea what could be causing this?

Any help would be greatly appreciated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jalan/pdftotext/issues/16#issuecomment-762172363, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASLD7K2H27CNGAE63DBLUNTS2QILHANCNFSM4EV2LW6A .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jalan/pdftotext/issues/16#issuecomment-762218036, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKVUWROKGAJJDZRJS5TIIDS2QSEZANCNFSM4EV2LW6A.

GodMakesMe commented 3 years ago

You can try latest version of poppler and extract the files according to the folders destination And Then try installing pdftotext If you are using anaconda then you can run conda install -c conda-forge poppler poppler-20.12.1-h31d4e15_3.zip

ReMiOS commented 3 years ago

I've had some trouble getting pdftotext working on Windows. But i managed with the following steps:

download poppler: https://anaconda.org/conda-forge/poppler/21.03.0/download/win-64/poppler-21.03.0-h9ff6ed8_0.tar.bz2 copy the contents from ..\poppler-21.03.0-h9ff6ed8_0\Library\lib\ to ..\\libs\ copy the contents from ..\poppler-21.03.0-h9ff6ed8_0\Library\include\poppler to ..\\include\poppler copy the DLLs from ..\poppler-21.03.0-h9ff6ed8_0\Library\include\bin*.dll to ..\\Lib\site-packages\

Copy the DLLs to ..\\Lib\site-packages charset.dll freetype.dll iconv.dll libcrypto-1_1-x64.dll libcurl.dll liblzma.dll libpng16.dll libssh2.dll openjp2.dll tiff.dll zlib.dll zstd.dll

Now you can install pdftotext with: pip install pdftotext-2.1.6-cp39-cp39-win_amd64.whl

Files are in attachment Conda_Forge_DLL_x64.zip

P.S. It would be great if the Poppler PDF rendering library based would be upgraded from the xpdf-3.0 to the xpdf-4.03 code base.

TheQuinbox commented 3 years ago

Would needing to in stall Poppler also effect users of the software, if they're running a compiled application that uses this package?

jalan commented 3 years ago

Would needing to in stall Poppler also effect users of the software, if they're running a compiled application that uses this package?

That depends on how you compile and distribute your application, of course. For example, on linux, you could bundle any needed shared libs with your app and start the app with an appropriate LD_LIBRARY_PATH set. I'm not a Windows developer, but I imagine there is something similar there.

ReMiOS commented 3 years ago

Using pyinstaller on Windows the Poppler DLLs are packed in the executable.

Using the poppler v21.10 the lcms2 DLL is needed (lcms color engine) Link to Poppler: https://anaconda.org/conda-forge/poppler/21.10.0/download/win-64/poppler-21.10.0-h24fffdf_0.tar.bz2 Updated Wheel: pdftotext-2.2.1-cp39-cp39-win_amd64.whl.zip

Updated DLL package: Conda_Forge_DLL_x64.zip poppler.dll v21.10.0 poppler-glib.dll v21.10.0 poppler-cpp.dll v21.10.0 freetype.dll v2.10 zlib.dll v1.2.11 libssh2.dll v1.10.0 cairo.dll v1.16.0 libtiff.dll tiff.dll v4.3.0 libzstd.dll en zstd.dll v1.5.0 libcurl v7.79.1.0 openjp2.dll v2.4.0 iconv.dll en charset.dll v1.16 libpng16.dll v1.6.37 liblzma.dll v5.2.2 libcrypto-1_1-x64.dll v1.1.1l lcms2.dll v2.12