jgehrcke / github-repo-stats

GitHub Action for advanced repository traffic analysis and reporting
Apache License 2.0
302 stars 41 forks source link

retry upon transient error during paginated stargazer/fork retrieval #82

Open jgehrcke opened 12 months ago

jgehrcke commented 12 months ago

A known limitation since starting to use pygithub for fetching data: when using their API to iterate through pages via e.g. for count, fork in enumerate(repo.get_forks(), 1) then the individual HTTP request is not retried upon transient error and it's also not really easy to do so (to retry the HTTP request corresponding to one specific page out of many pages) cleanly from the calling program.

Example of a boring transient error affecting one of many HTTP requests, taking down the entire action run.

...
231004-23:07:04.767 INFO:MainThread: 8000 forks fetched
231004-23:07:11.723 INFO:MainThread: 8200 forks fetched
231004-23:07:18.807 INFO:MainThread: 8400 forks fetched
...
Traceback (most recent call last):
  File "//fetch.py", line 596, in <module>
    main()
  File "//fetch.py", line 111, in main
    fetch_and_write_fork_ts(repo, args.fork_ts_outpath)
  File "//fetch.py", line 225, in fetch_and_write_fork_ts
    dfforkcsv = get_forks_over_time(repo)
  File "//fetch.py", line 434, in get_forks_over_time
    for count, fork in enumerate(repo.get_forks(), 1):
  File "/usr/local/lib/python3.10/site-packages/github/PaginatedList.py", line 56, in __iter__
    newElements = self._grow()
  File "/usr/local/lib/python3.10/site-packages/github/PaginatedList.py", line 67, in _grow
    newElements = self._fetchNextPage()
  File "/usr/local/lib/python3.10/site-packages/github/PaginatedList.py", line 199, in _fetchNextPage
    headers, data = self.__requester.requestJsonAndCheck(
  File "/usr/local/lib/python3.10/site-packages/github/Requester.py", line 354, in requestJsonAndCheck
    *self.requestJson(
  File "/usr/local/lib/python3.10/site-packages/github/Requester.py", line 454, in requestJson
    return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)
  File "/usr/local/lib/python3.10/site-packages/github/Requester.py", line 528, in __requestEncode
    status, responseHeaders, output = self.__requestRaw(
  File "/usr/local/lib/python3.10/site-packages/github/Requester.py", line 555, in __requestRaw
    response = cnx.getresponse()
  File "/usr/local/lib/python3.10/site-packages/github/Requester.py", line 127, in getresponse
    r = verb(
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
...

Retrying this naively at the higher level would involve fetching all forks again. Of course, this is Python and we can do all kinds of workarounds. But they would take more time to build and test.