openhatch / oh-bugimporters

Bug importers for the OpenHatch project oh-mainline
https://oh-bugimporters.readthedocs.org/
GNU Affero General Public License v3.0
12 stars 28 forks source link

oh-bugimporters should do per-domain backoff #81

Open ghost opened 10 years ago

ghost commented 10 years ago

Comment by paulproteus:

Some bug trackers (openhatch.org/bugs/ especially...) if you request more than 1- 2 bugs per second report HTTP 504 Gateway Timeout.

The way Scrapy handles this now is in the http://doc.scrapy.org/en/0.12/topics/downloader-middleware.html#module- scrapy.contrib.downloadermiddleware.retry middleware, which re-queues the job but doesn't insist on a time delay.

It'd be nice to have a custom RetryMiddleware that did per-domain backoff. (Note that we're sort of abusing the Scrapy architecture; we're supposed to have one "spider" class per domain, but instead we only have one.)

One way to do this is to provide a custom subclass of scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and then override the _retry method.

That should let us more reliably crawl some of the sites that are quite finnicky.


Status: unread Nosy List: paulproteus Priority: wish Imported from roundup ID: 793 (view archived page) Last modified: 2012-11-20.16:04:43