Open Bouke opened 3 years ago
I'm also slightly lost. As of today, the tutorial example reads
$ python tools/pdf2txt.py example.pdf
all the text from the pdf appears on the command line
which is not at all what does happen, or should happen, I think. I mean this suggests that we are using the current environment's python executable to run a script in the 'tools' folder, but what directory do they assume the user is in? I'm in my project folder, so the pdf2txt.py
file is actually under venv/bin/
which explains why
$ pdf2txt.py sample.pdf
works for both Bouke and myself. No tools
path necessary, and don't need to call it with python. Or am I missing something?
The documentation is also lacking the correct import statements, so the examples are not usable out of the box. Like what is the intended way to access extract_text()
? Should we be using the PDFParser()
class? I guess I am off to read through the full API then...
Ok, so most of my answers lie in the tutorial pages that follow, but I wasn't naturally inclined to move on to "part 2" or "Extract elements" before getting part 1 to work. All that the documentation needs is the change I mentioned above to the command line example, and the following line on https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html:
>>> from pdfminer.high_level import extract_text
So what is the correct way of setting this up?
I agree. The documentation is pretty bad.
Bug report
How to use section says to run the script like this:
python pdf2txt.py ...
. However after installing it through pip, that doesn't work:Are the usage instructions correct and is there something wrong with my installation, or should the instructions be changed?