howie6879 / ruia

Async Python 3.6+ web scraping micro-framework based on asyncio
https://www.howie6879.com/ruia/
Apache License 2.0
1.75k stars 181 forks source link

Rate Limiting? #78

Closed FaizShah closed 4 years ago

FaizShah commented 5 years ago

Hi,

Thanks for this wonderful project.

What is the recommended way to add rate limiting to the spider?

I normally add a randomized delay like for scraping:

requests.get(url)
time.sleep(abs(random.gauss(1, 0.5)) * 2)

To get a random delay with a mean of 2.

Another common one is rate limiting to 10 requests per minute or 1 request per 5 seconds.

What would be the recommended way to rate limit in ruia?

howie6879 commented 5 years ago

Hi @FaizShah, sorry for my slowly reply : Request object provides a parameter named request_config, for example:

from ruia import Spider

async def retry_func(request):
    request.request_config["TIMEOUT"] = 10

class RetryDemo(Spider):
    start_urls = ["http://httpbin.org/get"]

    request_config = {
        "RETRIES": 3,
        # add a delay for scraping
        "DELAY": 0,
        "TIMEOUT": 0.1,
        "RETRY_FUNC": retry_func,
    }

    async def parse(self, response):
        pages = ["http://httpbin.org/get?p=1", "http://httpbin.org/get?p=2"]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)

    async def parse_item(self, response):
        json_data = await response.json()
        print(json_data)

if __name__ == "__main__":
    RetryDemo.start()
FaizShah commented 4 years ago

Hi @howie6879,

Not a problem it was actually a really quick reply!

This solution is working perfectly for me.

If I just modify https://github.com/howie6879/ruia/blob/c7eb0f5e49457ca11bc0f3fd6196934b6befeb5b/ruia/request.py#L100-L101 to evaluate a function instead of a constant delay would that be the only thing I need to modify to create a randomized delay?

howie6879 commented 4 years ago

@FaizShah Hi: There is a simple way to solve your problem, Ruia can customize the request parameters of the request by using the middleware, for example:

#!/usr/bin/env python
import random
from ruia import Spider, Middleware

middleware = Middleware()

@middleware.request
async def print_on_request(spider_ins, request):
    request.request_config = {"RETRIES": 3, "DELAY": (abs(random.gauss(1, 0.5)) * 2)}
    print(f"request: {request.request_config}")

class MiddlewareSpiderDemo(Spider):
    start_urls = ["https://httpbin.org/get"]
    concurrency = 10

    async def parse(self, response):
        pages = [f"https://httpbin.org/get?p={i}" for i in range(1, 3)]
        for page in pages:
            yield self.request(url=page)

if __name__ == "__main__":
    MiddlewareSpiderDemo.start(middleware=middleware)

We can see that the terminal has the following output:

request: {'RETRIES': 3, 'DELAY': 2.4240349739809686}
[2019:11:12 20:52:07] INFO  Spider  Spider started!
[2019:11:12 20:52:07] INFO  Spider  Worker started: 4535383992
[2019:11:12 20:52:07] INFO  Spider  Worker started: 4535384128
[2019:11:12 20:52:10] INFO  Request <GET: https://httpbin.org/get>
https://httpbin.org/get?p=1
https://httpbin.org/get?p=2
request: {'RETRIES': 3, 'DELAY': 1.6505360498833879}
request: {'RETRIES': 3, 'DELAY': 1.8258791988736391}
[2019:11:12 20:52:12] INFO  Request <GET: https://httpbin.org/get?p=1>
[2019:11:12 20:52:13] INFO  Request <GET: https://httpbin.org/get?p=2>
[2019:11:12 20:52:14] INFO  Spider  Stopping spider: Ruia
[2019:11:12 20:52:14] INFO  Spider  Total requests: 3
[2019:11:12 20:52:14] INFO  Spider  Time usage: 0:00:06.218620
[2019:11:12 20:52:14] INFO  Spider  Spider finished!
howie6879 commented 4 years ago

More info about middleware

FaizShah commented 4 years ago

Perfect, thank you @howie6879