ogre0403 / ipgod

0 stars 3 forks source link

無法正常下載的資料,並沒有將失敗的資訊寫到ckan_download #13

Open ogre0403 opened 7 years ago

ogre0403 commented 7 years ago

eg: http://data.gov.tw/api/v1/rest/dataset/313000000G-000011

裡面的resource download URL是 http://dmz2.moea.gov.tw/aaweb/opendata/opendata.zip

當完全無法下載時,因無法連線,所以無法取得http response存在ckan_download資料表內的status欄位 1482996837245

ogre0403 commented 7 years ago

連線錯誤的log如下:

2016-12-29 17:54:36,646 [ERROR] Downloader.py_27 : Download 315860400M-000003-001 ERROR!!! Traceback (most recent call last): File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connection.py", line 142, in _new_conn (self.host, self.port), self.timeout, **extra_kw) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\util\connection.py", line 75, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\socket.py", line 732, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 595, in urlopen chunked=chunked) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 363, in _make_request conn.request(method, url, **httplib_request_kw) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\http\client.py", line 1083, in request self._send_request(method, url, body, headers) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\http\client.py", line 1128, in _send_request self.endheaders(body) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\http\client.py", line 1079, in endheaders self._send_output(message_body) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\http\client.py", line 911, in _send_output self.send(msg) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\http\client.py", line 854, in send self.connect() File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connection.py", line 167, in connect conn = self._new_conn() File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connection.py", line 151, in _new_conn self, "Failed to establish a new connection: %s" % e) requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x000001A31625FB38>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\adapters.py", line 423, in send timeout=timeout File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 640, in urlopen _stacktrace=sys.exc_info()[2]) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\util\retry.py", line 287, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='manager.twport.com.tw', port=80): Max retries exceeded with url: /Upload/E/FileDownload/11901/635721206005253384.csv (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000001A31625FB38>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\ProjectSource\PycharmProjects\ipgod\crawler\src\Downloader.py", line 25, in run self.download_flag = item.download() File "D:\ProjectSource\PycharmProjects\ipgod\crawler\src\metadata.py", line 87, in download response = requests.get(URL,stream=True,verify=False,headers={'Connection':'close'}) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\api.py", line 70, in get return request('get', url, params=params, kwargs) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\api.py", line 56, in request return session.request(method=method, url=url, kwargs) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\sessions.py", line 475, in request resp = self.send(prep, send_kwargs) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\sessions.py", line 596, in send r = adapter.send(request, kwargs) File "C:\Users\1403035\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\adapters.py", line 487, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='manager.twport.com.tw', port=80): Max retries exceeded with url: /Upload/E/FileDownload/11901/635721206005253384.csv (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x000001A31625FB38>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

PeterHKC commented 7 years ago

關於兩筆資料 1. ID: 313000000G-000011 downloadURL: http://dmz2.moea.gov.tw/aaweb/opendata/opendata.zip 這筆資料的URL是無效的,因為timeout,但DNS有找到這個hostname 然後我發現改成http://www.moea.gov.tw/aaweb/opendata/opendata.zip 就會得到正確的URL,但此檔案已經被移除(404)

2. ID: 315860400M-000003 downloadURL: http://manager.twport.com.tw/Upload/E/FileDownload/11901/635721206005253384.csv 這筆資料的URL也是無效,因為DNS查不到這個URL 然後我發現改成http://www.twport.com.tw/Upload/E/FileDownload/11901/635721206005253384.csv 就會成功得到正確的URL

我這邊的想法是先用socket.gethostbyname(URL)看看DNS找不找的到,有的話再判斷timeout 有兩個問題: 1.我不確定判斷timeout要在哪裡判斷(Socket或是Request)? 2.找不到DNS的錯誤代碼該是什麼?(504?)