lebedov / python-pdfbox

Python interface to Apache PDFBox command-line tools.
Other
75 stars 24 forks source link

extract_text goes on forever. #23

Open Rammurthy5 opened 4 years ago

Rammurthy5 commented 4 years ago

I installed latest PDFBox on my Mac via pip. I did an import and called on to the extract_text() method. And it keeps running perpetually for a 196 KB file. Please help.

>>> import pdfbox as p, os
>>> os.path.exists(f).  # f is the file path
True
>>> pp = p.PDFBox()
>>> pp.extract_text(f)

extract_text(f) doesn't end, runs perpetually.

lebedov commented 4 years ago

What version of Python, Java, and MacOS are you running? Can you attach the file you are trying to process? As noted in #14, I haven't been able to reproduce the problem.

Rammurthy5 commented 4 years ago

macOS: 10.15.6 Python: 3.7.1 Java: 1.8.0_202 pdf copy.pdf File attached.

lebedov commented 4 years ago

I didn't encounter any errors with the file you posted using the package versions in #14. Can you try using OpenJDK 14 rather than Oracle's Java?