openaddresses / pyesridump

Scrapes an ESRI MapServer REST endpoint to spit out more generally-usable geodata.
MIT License
322 stars 68 forks source link

Request 838 of 885 timed out, would you like to [A]bort or [S]kip or [R]etry and continue? #54

Open ArieRudich opened 6 years ago

ArieRudich commented 6 years ago

Let me start by giving BIG thanks for a most useful tool! THANKS!

The attached Traceback is of a timeout that occured after a LONG all night dump (request 838 out of 885 requests using resultOffset method).

Might be a good idea to catch this and turn it into a user prompt, something like:

preferably with a default TimeOutRetry=3 ( or more general FailRetry ) and a flag argument to override it.

Another helpful aid in this and similar situations (Just had a "similar"situation with "socket.gaierror: [Errno 11002] getaddrinfo failed") can be to expose the --resultOffset so it can restart an aborted download at the offset last reported by -v or, even better, reported by the exception handler.

What do you think?

Thanks!

Traceback (most recent call last):
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "c:\python37\Lib\http\client.py", line 1321, in getresponse
    response.begin()
  File "c:\python37\Lib\http\client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "c:\python37\Lib\http\client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "c:\python37\Lib\socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\util\retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\packages\six.py", line 686, in reraise
    raise value
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\urllib3\connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='***XXX***', port=80): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\esridump\dumper.py", line 418, in __iter__
    response = self._request('POST', query_url, headers=headers, data=query_args)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\esridump\dumper.py", line 43, in _request
    return requests.request(method, url, timeout=self._http_timeout, **kwargs)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\requests\adapters.py", line 526, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='www***XXX***', port=80): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

2018-07-09 05:22:02,786 - cli.esridump - DEBUG - POST http://www.***XXX***/MapServer/18/query, args {'resultOffset': 838000, 'resultRecordCount': 1000, 'where': '1=1', 'geometryPrecision': 7, 'returnGeometry': True, 'outSR': '4326', 'outFields': '*', 'f': 'json'}

Traceback (most recent call last):
  File "c:\python37\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\python37\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\xampp\htdocs\Data\Snippets\esridump\Scripts\esri2geojson.exe\__main__.py", line 9, in <module>
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\esridump\cli.py", line 111, in main
    feature = next(feature_iter)
  File "c:\xampp\htdocs\data\snippets\esridump\lib\site-packages\esridump\dumper.py", line 425, in __iter__
    raise EsriDownloadError("Could not connect to URL", e)
esridump.errors.EsriDownloadError: ('Could not connect to URL', ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='www.***XXX***', port=80): Read timed out. (read timeout=30)")))
iandees commented 6 years ago

Thanks for the suggestion! This might be a little tricky to pull off because of the way I built the command line tool on top of the library, but I'll think through it some this week.

ArieRudich commented 6 years ago

Many thanks for considering it in a positive attitude! I'd guess it may be easier to have a first cut with the 1000 chunks. So a repeat, resume or restart would be on a 1000 whole chunk regardless of where inside the chunk the actual failure occurred. Still a bit of manual work to filter the overlap, but a huge improvement. This will work best with --jsonlines and can be improved by good reporting by the 'catch' code on the offset where the actual abort occurred.

On 7/9/18, Ian Dees notifications@github.com wrote:

Thanks for the suggestion! This might be a little tricky to pull off because of the way I built the command line tool on top of the library, but I'll think through it some this week.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/openaddresses/pyesridump/issues/54#issuecomment-403467834

andrewharvey commented 6 years ago

:+1: to automated retry, I commonly encounter things like below, which upon just trying again (I've hacked in a way to pass in the offset and total feature count) to pick up where we left off)

Traceback (most recent call last):
  File "/usr/local/bin/esri2geojson", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/esridump/cli.py", line 116, in main
    feature = next(feature_iter)
  File "/usr/local/lib/python2.7/dist-packages/esridump/dumper.py", line 425, in __iter__
    raise EsriDownloadError("Could not connect to URL", e)
esridump.errors.EsriDownloadError: ('Could not connect to URL', EsriDownloadError("http://maps.six.nsw.gov.au/arcgis/rest/services/public/NSW_Property/MapServer/4/query: Could not retrieve this chunk of objects HTTP 504 <html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n",))
andrewharvey commented 5 years ago

@ArieRudich Although this ticket is about being able to restart when timeouts occur, for your original timeout issue, if the layer has an ID field, then I've found forcing esri2geojson to query by ID range avoids timeouts, you can now do this with --paginate-oid, from https://github.com/openaddresses/pyesridump/commit/a4c68db6025264e323bb73b9d901be6050ae6f7f. Could you try that and see if it makes a difference?

jayarehart commented 5 years ago

@andrewharvey I have this similar problem as described above, tried --paginate-oid, but did not have any luck with stopping timeouts. Ended up doing it manually, but that required closing the brackets on time the file timed out.