Closed FaizShah closed 4 years ago
Hi @FaizShah, sorry for my slowly reply :
Request object provides a parameter named request_config
, for example:
from ruia import Spider
async def retry_func(request):
request.request_config["TIMEOUT"] = 10
class RetryDemo(Spider):
start_urls = ["http://httpbin.org/get"]
request_config = {
"RETRIES": 3,
# add a delay for scraping
"DELAY": 0,
"TIMEOUT": 0.1,
"RETRY_FUNC": retry_func,
}
async def parse(self, response):
pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
async for resp in self.multiple_request(pages):
yield self.parse_item(response=resp)
async def parse_item(self, response):
json_data = await response.json()
print(json_data)
if __name__ == "__main__":
RetryDemo.start()
Hi @howie6879,
Not a problem it was actually a really quick reply!
This solution is working perfectly for me.
If I just modify https://github.com/howie6879/ruia/blob/c7eb0f5e49457ca11bc0f3fd6196934b6befeb5b/ruia/request.py#L100-L101 to evaluate a function instead of a constant delay would that be the only thing I need to modify to create a randomized delay?
@FaizShah Hi: There is a simple way to solve your problem, Ruia can customize the request parameters of the request by using the middleware, for example:
#!/usr/bin/env python
import random
from ruia import Spider, Middleware
middleware = Middleware()
@middleware.request
async def print_on_request(spider_ins, request):
request.request_config = {"RETRIES": 3, "DELAY": (abs(random.gauss(1, 0.5)) * 2)}
print(f"request: {request.request_config}")
class MiddlewareSpiderDemo(Spider):
start_urls = ["https://httpbin.org/get"]
concurrency = 10
async def parse(self, response):
pages = [f"https://httpbin.org/get?p={i}" for i in range(1, 3)]
for page in pages:
yield self.request(url=page)
if __name__ == "__main__":
MiddlewareSpiderDemo.start(middleware=middleware)
We can see that the terminal has the following output:
request: {'RETRIES': 3, 'DELAY': 2.4240349739809686}
[2019:11:12 20:52:07] INFO Spider Spider started!
[2019:11:12 20:52:07] INFO Spider Worker started: 4535383992
[2019:11:12 20:52:07] INFO Spider Worker started: 4535384128
[2019:11:12 20:52:10] INFO Request <GET: https://httpbin.org/get>
https://httpbin.org/get?p=1
https://httpbin.org/get?p=2
request: {'RETRIES': 3, 'DELAY': 1.6505360498833879}
request: {'RETRIES': 3, 'DELAY': 1.8258791988736391}
[2019:11:12 20:52:12] INFO Request <GET: https://httpbin.org/get?p=1>
[2019:11:12 20:52:13] INFO Request <GET: https://httpbin.org/get?p=2>
[2019:11:12 20:52:14] INFO Spider Stopping spider: Ruia
[2019:11:12 20:52:14] INFO Spider Total requests: 3
[2019:11:12 20:52:14] INFO Spider Time usage: 0:00:06.218620
[2019:11:12 20:52:14] INFO Spider Spider finished!
More info about middleware
Perfect, thank you @howie6879
Hi,
Thanks for this wonderful project.
What is the recommended way to add rate limiting to the spider?
I normally add a randomized delay like for scraping:
To get a random delay with a mean of 2.
Another common one is rate limiting to 10 requests per minute or 1 request per 5 seconds.
What would be the recommended way to rate limit in ruia?