lebedov / python-pdfbox

Python interface to Apache PDFBox command-line tools.
Other
75 stars 24 forks source link

Windows PDFBox.PDFBox() fails at Urllib error #24

Open Rammurthy5 opened 4 years ago

Rammurthy5 commented 4 years ago

When I merely import pdbox, and initiate the PDFBox() function, it immediately throws an error message as following. Please help

urllib.error.URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connection host has failed to respond>

lebedov commented 4 years ago

Looks a network connectivity issue. Is your computer connected to the Internet through a corporate firewall? If you start a python3 session on your Windows box, does the following code run without any exception?

import urllib
r = urllib.request.urlopen('https://archive.apache.org/dist/pdfbox/')
data = r.read()
Rammurthy5 commented 4 years ago

No I couldn't. it throws the same error. How do I fix it? can I add proxy address to it ?

lebedov commented 4 years ago

You can try setting the environmental variable http_proxy or https_proxy (depending on the protocol) to the URI of your proxy before importing pdfbox.

Another possibility is to set the user agent to that of a common web browser, as some firewalls block HTTP requests that do not appear to come from the latter; try the following code and see whether it throws an error:


import urllib
req = urllib.request.Request(
    url='https://archive.apache.org/dist/pdfbox/',
    data=None,
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'
    }
)
r = urllib.request.urlopen(req)
data = r.read()