pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.94k stars 930 forks source link

Incorrect installation/usage instructions? #660

Open Bouke opened 3 years ago

Bouke commented 3 years ago

Bug report

How to use section says to run the script like this: python pdf2txt.py .... However after installing it through pip, that doesn't work:

$ python pdf2txt.py
/usr/local/Cellar/python@3.9/3.9.6/libexec/bin/python: can't open file '[CWD]/pdf2txt.py': [Errno 2] No such file or directory
$ pdf2txt
zsh: command not found: pdf2txt
$ pdf2txt.py
usage: pdf2txt.py [-h] [--version] [--debug] [--disable-caching]

Are the usage instructions correct and is there something wrong with my installation, or should the instructions be changed?

d-tork commented 3 years ago

I'm also slightly lost. As of today, the tutorial example reads

$ python tools/pdf2txt.py example.pdf
all the text from the pdf appears on the command line

which is not at all what does happen, or should happen, I think. I mean this suggests that we are using the current environment's python executable to run a script in the 'tools' folder, but what directory do they assume the user is in? I'm in my project folder, so the pdf2txt.py file is actually under venv/bin/ which explains why

$ pdf2txt.py sample.pdf

works for both Bouke and myself. No tools path necessary, and don't need to call it with python. Or am I missing something?

The documentation is also lacking the correct import statements, so the examples are not usable out of the box. Like what is the intended way to access extract_text()? Should we be using the PDFParser() class? I guess I am off to read through the full API then...

d-tork commented 3 years ago

Ok, so most of my answers lie in the tutorial pages that follow, but I wasn't naturally inclined to move on to "part 2" or "Extract elements" before getting part 1 to work. All that the documentation needs is the change I mentioned above to the command line example, and the following line on https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html:

>>> from pdfminer.high_level import extract_text
amine-aboufirass commented 2 years ago

So what is the correct way of setting this up?

pietermarsman commented 2 years ago

I agree. The documentation is pretty bad.

phil294 commented 2 years ago

check https://github.com/pdfminer/pdfminer.six/issues/691

crystalthoughts commented 6 months ago

691 isn't correct either