lebedov / python-pdfbox

Python interface to Apache PDFBox command-line tools.
Other
75 stars 24 forks source link

Unable to pass path to the PDF file on google cloud #31

Closed mudasserch1 closed 2 years ago

mudasserch1 commented 2 years ago

Hey. I am working to read data from the cloud and it's around 800k resume I want to read. Is there any way to use the google cloud path in extract_text function?

My error is: java.io.FileNotFoundException: java.io.FileNotFoundException

This means path not found but when I copy the same link in the browser it shows me a valid resume.

lebedov commented 2 years ago

Not directly - the issue is that the Java pdfbox package wrapped by python-pdfbox doesn't know how to deal with files that need to be downloaded over a network.

One possibility you might want to explore is to download the data (e.g., with python requests or some other library) to a temporary file and pass that to python-pdfbox.

mudasserch1 commented 2 years ago

Thanks @lebedov