istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Elastic Moderated Throttled Queue #47

Closed madisonb closed 7 years ago

madisonb commented 8 years ago

The current RedisThrottledQueue when used under Moderation causes a slight drift in the actual processing of X number of hits in Y time. This detla d is then added for each window, so that the successful number of X hits in Y time is really X hits in Y + d time. This delay is not normally an issue, but crops up when really high velocity moderated keys occurs (ie 60 hits in 60 seconds ends up being ~61 or ~62 seconds). This may be due to network latency or improper implementation.

We should have an ability to 'catch back up' or fix the moderation implementation so that the numbers line up with exactly what is defined. This may involve adding more items to the throttle_time key, or setting a minimum moderation value so that the catch up does not cause huge spikes in domain hits.

softwarevamp commented 7 years ago

I find a interesting example http://blog.gregburek.com/2011/12/05/Rate-limiting-with-decorators/, may help you address the issue.

madisonb commented 7 years ago

The latest git commit that fixes this issue looks really good, and should allow us to hit exactly the limit and window we set. Here is an example of 60 hits in 60 secs.

screen shot 2017-01-08 at 5 13 55 pm