Closed paulproteus closed 9 years ago
See https://github.com/openhatch/oh-mainline/issues/1515 for user reports + serious debugging work.
This doesn't seem addressed. About half of the GitHub requests are failing, in a pattern that suggests rate limiting of 60 requests per hour, as per https://developer.github.com/v3/#rate-limiting:
For unauthenticated requests, the rate limit allows you to make up to 60 requests per hour.
See the relevant bits of the logs for evidence. It also makes it look like 1 second delay isn't going to be nearly enough - if delay is the approach taken here, it needs to be one minute so requests are kept to 60/hour.
I'm having trouble running this code locally, so can't suggest code with confidence.
The simplest fix looks to me like not using the Scrapy basic auth middleware (though that feels like the "right" solution) but instead adding either adding &client_id=xxxx&client_secret=yyyy
to the end of the urls before constructing the request with a client key and secret at https://github.com/openhatch/oh-bugimporters/blob/master/bugimporters/github.py#L31 OR constructing a header at this same spot (scrapy.http.Request
takes a headers argument) with a manually constructed authentication header, as described here or using something in the stdlib to build the header, though everything I've found there seems roundabout.
The oauth client id and secret in the query parameters approach requires creating an oauth application (several clicks on GitHub). The simple auth solution requires creating a GitHub user whose username and password can be used in the script. These should be stored in environmental variables or something so they're not hardcoded.
I think I can make this work with GitHub basic auth plus a personal access token. I'm looking into that now.
Thanks again to @thomasballinger for doing the research on this.
FWIW, I am running a fresh crawl right now, and we'll see if my changes to oh-bugimporters fix this. If so, I'll submit a pull request.
@paulproteus can this be closed?
+1
At the time of writing, oh-bugimporters has difficulty downloading all the bugs it wants to from github.com.
@ehashman discovered that GitHub throttles API requests after 5000 per hour.