lebedov / python-pdfbox

Python interface to Apache PDFBox command-line tools.
Other
75 stars 24 forks source link

Missing required option: '--input=<infile>' when using extract_text #26

Closed VictorZuanazzi closed 3 years ago

VictorZuanazzi commented 3 years ago

I am getting some weird new error with pdfbox. That was working fine until today at 1000 (Amsterdam time)

That is how I call it:

pdf_box = pdfbox.PDFBox()
temp = tmp_path / "hello.txt"
pdf_box.extract_text('tests/testfiles/dummy.pdf', temp.as_posix())    

And the pdbfox returns an CLI error:

tests/unit/test_pdfbox.py::test_pdf_extractor Missing required option: '--input=<infile>'
Extracts the text from a PDF document
Usage: extracttext [-hV] [-alwaysNext] [-console] [-debug] [-html]
                   [-ignoreBeads] [-rotationMagic] [-sort] [-password
                   [=<password>]] [-encoding=<encoding>] [-endPage=<endPage>]
                   -i=<infile> [-o=<outfile>] [-startPage=<startPage>]
      -alwaysNext            Process next page (if applicable) despite
                               IOException (ignored when -html)
      -console               Send text to console instead of file
      -debug                 Enables debug output about the time consumption of
                               every stage
      -encoding=<encoding>   UTF-8 or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
                               (default: UTF-8)
      -endPage=<endPage>     The last page to extract (1 based, inclusive)
  -h, --help                 Show this help message and exit.
      -html                  Output in HTML format instead of raw text
  -i, --input=<infile>       the PDF file
      -ignoreBeads           Disables the separation by beads
  -o, --output=<outfile>     the exported text file
      -password[=<password>] the password for the PDF or certificate in
                               keystore.
      -rotationMagic         Analyze each page for rotated/skewed text, rotate
                               to 0° and extract separately (slower, and
                               ignored when -html)
      -sort                  Sort the text before writing of every stage
      -startPage=<startPage> The first page to start extraction (1 based)
  -V, --version              Print version information and exit.
make: *** [Makefile:167: unit-tests] Error 2

Anyone had a similar issue? How to solve it?

VictorZuanazzi commented 3 years ago

In case someone else has a similar issue, we found a work around :

TLDR: download the version 2.0.23 wget -r --level 1 https://archive.apache.org/dist/pdfbox/2.0.23/ and set `os.environ['PDFBOX'] = './archive.apache.org/dist/pdfbox/2.0.23/pdfbox-app-2.0.23.jar'

Long version:

python-pdfbox defaults to the latest pdfbox java app if none is given. However, PDFBox had a major release that broke the CLI interface python-pdfbox uses. The releases can be found here: https://archive.apache.org/dist/pdfbox/?C=M;O=D

To fix the issue, it is necessary to point python-pdfbox to the version that was working. To do so we have to download it:

mkdir pdfbox
cd pdfbox
wget -r --level 1 https://archive.apache.org/dist/pdfbox/2.0.23/ 

Then you have to set create the environment variable PDFBOX. A hacky way of doing that is by adding this line on top of the first python file to be executed:

`os.environ['PDFBOX']='./pdfbox/archive.apache.org/dist/pdfbox/2.0.23/pdfbox-app-2.0.23.jar'

Hope that helps for while the library is not updated to work with PDFBOX 3.0

lebedov commented 3 years ago

The command line options of the PDFBox app have changed in version 3.0 - python-pdfbox needs to be updated to be able to handle the new interface. As a temporary fix, revert the pdfbox-app-*.jar file downloaded by python-pdfbox to an earlier version - you can find the path to the jar file using

import pdfbox
p = pdfbox.PDFBox()
print(p.pdfbox_path)
lebedov commented 3 years ago

Uploaded updated version that only downloads PDFBox 2.*.