voussoir / timesearch

The subreddit archiver
BSD 3-Clause "New" or "Revised" License
172 stars 7 forks source link

Recovering from exception #2

Closed Brotakuu closed 6 years ago

Brotakuu commented 6 years ago

After running timesearch through a huge sub, the bot exited with an exception.

Is there a way to resume progress from where it exited? Running the timesearch again only grabs the most recent threads from the top (does not attempt to continue where it left off).

Also: any idea what might be the cause? (running 2 instances with different apps configured on mac os)

Jul 14 2015 13:28:52 - Jul 14 2015 12:37:17 +100
Jul 14 2015 12:36:47 - Jul 14 2015 11:16:45 +100
Jul 14 2015 11:16:19 - Jul 14 2015 09:50:38 +100
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/requestor.py", line 47, in request
    return self._http.request(*args, timeout=TIMEOUT, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "timesearch.py", line 11, in <module>
    status_code = timesearch.main(sys.argv[1:])
  File "/Users/1/ts/timesearch/__init__.py", line 425, in main
    args.func(args)
  File "/Users/1/ts/timesearch/__init__.py", line 329, in timesearch_gateway
    timesearch.timesearch_argparse(args)
  File "/Users/1/ts/timesearch/timesearch.py", line 152, in timesearch_argparse
    interval=common.int_none(args.interval),
  File "/Users/1/ts/timesearch/timesearch.py", line 78, in timesearch
    for chunk in submissions:
  File "/Users/1/ts/timesearch/common.py", line 66, in generator_chunker
    for item in generator:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/reddit/subreddit.py", line 451, in submissions
    sort='new', syntax='cloudsearch'):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/listing/generator.py", line 52, in __next__
    self._next_batch()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/models/listing/generator.py", line 62, in _next_batch
    self._listing = self._reddit.get(self.url, params=self.params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/reddit.py", line 367, in get
    data = self.request('GET', path, params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/praw/reddit.py", line 472, in request
    params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 181, in request
    params=params, url=url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 112, in _request_with_retries
    data, files, json, method, params, retries, url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/sessions.py", line 97, in _make_request
    params=params)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/rate_limit.py", line 33, in call
    response = request_function(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prawcore/requestor.py", line 49, in request
    raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)
voussoir commented 6 years ago

Sorry for the inconvenience, this is another artifact from the PRAW3 to PRAW4 transition. The posts used to be collected oldest-first so that's why it only checks for newer posts when you run it a second time. Now they're collected newest-first and that isn't as good.

You can still provide the --upper and --lower arguments on the commandline. So for example lower should be the timestamp that the subreddit was created, which you can find on the json:

https://www.reddit.com/r/askreddit/about.json Search for "created_utc"

For upper you can use the timestamp from before it crashed. July 14 2015 is approximately 1436897779 but maybe you should set it a bit higher to account for possible timezone offsets.

TLDR:

> timesearch timesearch -r askreddit --upper 1436984179 --lower 1201146735
voussoir commented 6 years ago

The cause is just a read timeout, which means the website was probably too busy. Timestamp searching is fairly expensive which is probably one of the reasons they're killing it / the new platform doesn't support it. It's not your fault.