openhatch / oh-bugimporters

Bug importers for the OpenHatch project oh-mainline
https://oh-bugimporters.readthedocs.org/
GNU Affero General Public License v3.0
12 stars 28 forks source link

GitHub download code needs to be nicer to github.com's servers #110

Closed paulproteus closed 9 years ago

paulproteus commented 9 years ago

At the time of writing, oh-bugimporters has difficulty downloading all the bugs it wants to from github.com.

@ehashman discovered that GitHub throttles API requests after 5000 per hour.

paulproteus commented 9 years ago

109 is a pull request to address this

See https://github.com/openhatch/oh-mainline/issues/1515 for user reports + serious debugging work.

ehashman commented 9 years ago

109 should resolve this. We'll see after tonight's scrape!

thomasballinger commented 9 years ago

This doesn't seem addressed. About half of the GitHub requests are failing, in a pattern that suggests rate limiting of 60 requests per hour, as per https://developer.github.com/v3/#rate-limiting:

For unauthenticated requests, the rate limit allows you to make up to 60 requests per hour.

See the relevant bits of the logs for evidence. It also makes it look like 1 second delay isn't going to be nearly enough - if delay is the approach taken here, it needs to be one minute so requests are kept to 60/hour.

thomasballinger commented 9 years ago

I'm having trouble running this code locally, so can't suggest code with confidence.

The simplest fix looks to me like not using the Scrapy basic auth middleware (though that feels like the "right" solution) but instead adding either adding &client_id=xxxx&client_secret=yyyy to the end of the urls before constructing the request with a client key and secret at https://github.com/openhatch/oh-bugimporters/blob/master/bugimporters/github.py#L31 OR constructing a header at this same spot (scrapy.http.Request takes a headers argument) with a manually constructed authentication header, as described here or using something in the stdlib to build the header, though everything I've found there seems roundabout.

The oauth client id and secret in the query parameters approach requires creating an oauth application (several clicks on GitHub). The simple auth solution requires creating a GitHub user whose username and password can be used in the script. These should be stored in environmental variables or something so they're not hardcoded.

paulproteus commented 9 years ago

I think I can make this work with GitHub basic auth plus a personal access token. I'm looking into that now.

Thanks again to @thomasballinger for doing the research on this.

paulproteus commented 9 years ago

FWIW, I am running a fresh crawl right now, and we'll see if my changes to oh-bugimporters fix this. If so, I'll submit a pull request.

ehashman commented 9 years ago

@paulproteus can this be closed?

paulproteus commented 9 years ago

+1​