scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Per request delay #802

Open chekunkov opened 9 years ago

chekunkov commented 9 years ago

Sometimes I feel like scrapy is missing per request delays. Any reasons why they weren't implemented?

Where can per request delays be used:

I think there could be another situations where they can be useful.

Seems like this question was also raised in #254, but I don't like the way it was implemented there and could spend some time on another implementation and provide PR

nramirezuy commented 9 years ago

@chekunkov I think the main reason is because is hard to implement and maintain.

dangra commented 9 years ago

Sometimes I feel like scrapy is missing per request delays. Any reasons why they weren't implemented?

so far nobody took the work to design, argue and implement a solution.

I think the scheduler is the component that must handle per-request delays, but it doesn't look like a simple task.

curita commented 9 years ago

@chekunkov What do you think of adding this feature as an idea for this year's GSoC? Don't know if you've already started implementing it, but seems like a really fun and challenging project for a student, and provided of good mentoring (I'd love to have you as a mentor if you're interested) seems feasible on GSoC's timeframe.

chekunkov commented 9 years ago

@Curita Yes, seems like a good idea, I haven't started working on it so it's okay to give this to someone as a GSoC project. Concerning your proposal to make me a mentor for this project - I need some time to think on this, not sure if I'll have enough time and energy to help students :)

rocioar commented 9 years ago

hey @chekunkov @Curita I have this implemented for epubdirect project, will generalize it and make a PR, if that sounds good.

chekunkov commented 9 years ago

@rocioar yay, sounds great I think

kmike commented 9 years ago

Global DOWNLOAD_DELAY means "time between two requests for the same downloader slot (==website)". It is also possible to set the download delay per Downloader slot (website), though this API is neither public nor documented.

Delay is a time between two events, right? What does "per-request" delay mean? Which two events are separated by this delay?

Sorry, I must be missing something obvious :) but "per-request delay" currently doesn't make any sense to me.

umrashrf commented 9 years ago

I think they meant the delay between the scheduler and downloader. So you schedule a request with delay and then it is downloaded after that delay.

nramirezuy commented 9 years ago

@chekunkov Something like this makes sense to you?

# DownloaderMiddleware
class PerRequestDelayMware(object):
    ...
    # signal
    def request_scheduled(self, request, spider):
        request.meta.setdefault('_scheduled_time', time.time())

    def process_request(self, request, spider):
        scheduled_time = request.meta.get('_scheduled_time')
        per_request_delay = request.meta.get('per_request_delay')
        if not scheduled_time or not per_request_delay:
            return
        if scheduled_time + per_request_delay < time.time():
            request.priority += some_negative_value
            return request
rocioar commented 9 years ago

Well, my idea of it was:

Creating another extension, call it CustomDelayThrottle... or whatever, that will inherit almost all funcionality from Autothrottle, only different thing is that will setup delays specified in settings for the different domains. Discussing with @kmike he mentioned it would be better to make slots api better and make it public, make some docs and add this CustomDelayThrottle as an example there, which I thought was reasonable.

nramirezuy commented 9 years ago

@rocioar AutoThrottle handles slots delays not Request delays.

rocioar commented 9 years ago

I know, slots are divided per domain, that means using the functionality that Autothrottle already has we could set up custom delays per domain.

Something like:

DOMAIN_DELAYS = {
  'amazon.com': 1.0,
  'amazon.co.uk': 0.5
}

for domain in DOMAIN_DELAYS:
  if domain in response.url:
    slot.delay = DOMAIN_DELAYS[domain]
nramirezuy commented 9 years ago

@rocioar Autothrottle logic doesn't cover to add exponential backoff for the retry request. It will slow down the whole slot, not just that single request. Your proposal is valid and I like it, but doesn't solve the current issue.

chekunkov commented 9 years ago

I drew 2 schemes - I hope they will help to explain what I mean

As far as remember when I was creating this issue I had following problem - for some specific URL (some API call?) site returned error (or empty body?) from time to time (non predictable) and only way to workaround this was retry with some delay. Problem is - it was 9 months ago and I don't remember details and also I don't remember why we haven't just changed slot delay. Possible reasons:

1) slot API isn't public, it's not clear how to change slot delay in callback or create and use different slot for retries - so I had not idea I can use that

2) for some reason I didn't want to affect other requests running in background and wanted to apply delay only to given retry request (see fig. 1)

fig. 1 photo_3-17-15__20_11_03

And what I probably wanted in this case is to be able to apply custom delays only to requests where they are set (see fig. 2)

fig. 2 photo_3-17-15__20_37_12

i.e. use slot.download_delay() by default and Request.download_delay when it's set.

Delay is a time between two events, right? What does "per-request" delay mean? Which two events are separated by this delay?

@kmike two events - fired requests from single slot. "per-request" - because here I meant delay that's different from default and is set in Request object (see fig. 2). Ideally - as a Request object attribute, by analogy with dont_filter.

Something like this makes sense to you?

@nramirezuy theoretically this solution fits problem I described. Nothing like this came to mind 9 months ago, so maybe not very obvious one. Wouldn't it be nice to have something like this handled out of the box by scheduler?

using the functionality that Autothrottle already has we could set up custom delays per domain.

@rocioar you mean not only set it during crawl initialisation but also to be able to change it from spider or using some request.meta key - right?

kmike commented 8 years ago

I faced a problem with retries today, so it seems I'm finally starting to understand what problems did you have :snail:

Tweaking priorities (as in current RetryMiddleware or in @nramirezuy's example) is not enough: say we have 100 requests to the same server, server is temporarily down (e.g. for 5 seconds), we push back first 16, then next 16, then next (all this is really quick because server is dropping connections); in the end we have the first 16 requests processed again, without any waiting - they will turn out to be requests with a highest priority again soon. It means Scrapy will retry all these 100 requests for RETRY_TIMES times and fail them all (likely during these 5s of downtime) - instead of waiting a bit.

To make it work we need a way to say: "please process this request, but no sooner than X seconds from now". It is IMHO different from rate limiting (i.e. from delays between requests). You may need both: wait X seconds and then process the request, respecting rate and concurrency limits. So I think this "pre-request delay" is not an overridden value of a slot delay, it is a totally different thing.

As usual, @dangra is right and it looks like the best place to handle it is a scheduler.

Without scheduler support one can use callLater to schedule a request to a later time. It will have the same effect ("call no sooner than X seconds from now, respect rate limits"), but the downside is that the request is kept in memory for these X seconds, not in a scheduler queue. It means persistence is not supported, with all the consequences: requests are dropped in case of a restart, increased memory usage.

To implement it in a scheduler we'll have to make 2 big changes:

  1. All queues must implement delays support.
  2. There should be a way for scheduler/queue to say "I have some requests, so don't stop the spider, but I won't give you anything now".
nramirezuy commented 8 years ago

@kmike is what my approach does. Send requests back to the scheduler if they didn't wait those <per_request_delay>(5 sec).

        if scheduled_time + per_request_delay < time.time():
            request.priority += some_negative_value
            return request

am I missing something?

EDIT: I guess you want to edit _scheduled_time after an exception.

kmike commented 8 years ago

@nramirezuy hm, isn't it inefficient? Let's say you have 1 request, you'll be sending it to scheduler and getting it back in a busy loop until its time will come. Or is it fixed by https://github.com/scrapy/scrapy/pull/1253?

nramirezuy commented 8 years ago

@kmike first of all we change priority, so requests will rotate. I don't know what do you mean by a busy loop; but the heaviest task is reading the request from the queue and this is something the scheduler can't avoid.

kmike commented 8 years ago

If all requests are to the same server and this server is not available then priorities do nothing useful.

By busy loop I mean a case when Scrapy reads a request from a scheduler, detects that it should be delayed, sends it back to the scheduler, then reads the same (or similar) request again (because there are no requests in scheduler which shouldn't be delayed), sends it back again without processing, etc. If implemented naively, it will consume 100% CPU without doing anything helpful.

nramirezuy commented 8 years ago

Well changing priorities allow request rotation; this is useful when different requests might have different delays.

No it's not using 100% CPU don't worry :smile:

chekunkov commented 8 years ago

I can only repeat myself

@nramirezuy theoretically this solution fits problem I described. Nothing like this came to mind 9 months ago, so maybe not very obvious one. Wouldn't it be nice to have something like this handled out of the box by scheduler?

@kmike

To make it work we need a way to say: "please process this request, but no sooner than X seconds from now". It is IMHO different from rate limiting (i.e. from delays between requests).

yeah, agree, your explanation of the problem is much better, sorry for poor problem statement and examples. I think yes, this could be something completely different from slot delay - we should give a good name, maybe 'hold_delay', or just 'hold'. if it will be implemented in scheduler and described in documentation - no doubts it will be widely used.

There should be a way for scheduler/queue to say "I have some requests, so don't stop the spider, but I won't give you anything now".

this could be challenging especially taking in account how overcomplicated is downloader implementation

dangra commented 8 years ago

There should be a way for scheduler/queue to say "I have some requests, so don't stop the spider, but I won't give you anything now". this could be challenging especially taking in account how overcomplicated is downloader implementation

I think this is the easy part, it is handled by engine and there is an existent check for pending requests en scheduler queues: https://github.com/scrapy/scrapy/blob/master/scrapy/core/engine.py#L165

the hard part is making the scheduler aware of delayed requests.

eLRuLL commented 6 years ago

@kmike any reason this isn't a GSoC candidate anymore? 😄

kmike commented 6 years ago

@eLRuLL I'm not aware this issue was a GSoC candidate :) Haha, it was, and I removed a label. Ok, so I think the issue is rather small for a GSoC project. But it can be a part of a GSoC project, e.g. a project on improving a built-in scheduler.

misssprite commented 5 years ago

Any progress on this issue? I came across similar problems with instable server. It'll be great to have a configurable delay for a request before scheduled next time.

apalala commented 4 years ago

@starrify helped design this, which solved the case of polling computed solutions from some services:

# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals

from twisted.internet import reactor
import twisted.internet.task

DELAY_META = '__defer_delay'

def defer_request(seconds, request):
    meta = dict(request.meta)
    meta.update({DELAY_META: seconds})
    return request.replace(meta=meta)

class DeferMiddleware(object):
    def process_request(self, request, spider):
        delay = request.meta.pop(DELAY_META, None)
        if not delay:
            return

        return twisted.internet.task.deferLater(reactor, delay, lambda: None)
kmike commented 4 years ago

@apalala thanks for pasting an example! This is an implementation "without scheduler support" we were talking about; there are some downsides (https://github.com/scrapy/scrapy/issues/802#issuecomment-140446223)

Without scheduler support one can use callLater to schedule a request to a later time. It will have the same effect ("call no sooner than X seconds from now, respect rate limits"), but the downside is that the request is kept in memory for these X seconds, not in a scheduler queue. It means persistence is not supported, with all the consequences: requests are dropped in case of a restart, increased memory usage.

GeorgeA92 commented 4 years ago

I have some thoughts related to this issue.

In general case download slot delay based on DOWNLOAD_DELAYand RANDOMIZE_DOWNLOAD_DELAYsettings values:

https://github.com/scrapy/scrapy/blob/b8594353d03be5574f51766c35566b713584302b/scrapy/core/downloader/__init__.py#L36-L39 In this implementation of per request delays (based on this https://github.com/scrapy/scrapy/issues/802#issuecomment-78143005 and actual code (v1.6) of downloader class) I propose to get delay params from per_request_delay meta key from the first request in download slot queue:

class PerRequestDelaySlot(Slot):
    def download_delay(self):
        if self.queue:
            if "per_request_delay" in self.queue[0][0].meta.keys():
                #print("PER_REQUEST_DELAY:" +str(self.queue[0][0].meta["per_request_delay"]))
                return self.queue[0][0].meta["per_request_delay"]
        #from original:
        if self.randomize_delay: #
            return random.uniform(0.5 * self.delay, 1.5 * self.delay)
        return self.delay

Example of usage provided in this gist code sample. @kmike

2. There should be a way for scheduler/queue to say "I have some requests, so don't stop the spider, but I won't give you anything now".

From detailed observation of Downloader._process_queue method I see another 2 possibilities to do it:

slot.latercall

This code line returns twisted.internet.base.DelayedCall object: https://github.com/scrapy/scrapy/blob/b8594353d03be5574f51766c35566b713584302b/scrapy/core/downloader/__init__.py#L150 According to DelayedCall documentation we can additionally apply it's delay method:

class DelaySlotLatercallMiddleware(object):

    delay = 60*30 #seconds

    def process_response(self, request, response, spider):
        #...
        #...
        if TempBanCondition and "download_slot" in response.meta.keys():
            d_slot = response.meta["download_slot"]
            if spider.crawler.engine.downloade.slots[d_slot].latercall:
                spider.crawler.engine.downloade.slots[d_slot].latercall.delay(self.delay)
        return response

This approach requires not None value of slot.latercall (at least one scheduled request in specified download slot)
slot.lastseen We can add additional delay by affecting penalty variable from this code: https://github.com/scrapy/scrapy/blob/b8594353d03be5574f51766c35566b713584302b/scrapy/core/downloader/__init__.py#L144-L151 For example by changing download slot lastseen value:

from time import time
class DelaySlotLastSeenMiddleware(object):

    delay = 60*30 #seconds

    def process_response(self, request, response, spider):
        if TempBanCondition and "download_slot" in response.meta.keys():
            d_slot = response.meta["download_slot"]
            #if not spider.crawler.engine.downloade.slots[d_slot].queue:
            spider.crawler.engine.downloade.slots[d_slot].lastseen = time() + self.delay
        return response

In opposite to slot.latercall approach - currently this approach will work only if no requests executed because this code lines can shift back slot lastseen variable: https://github.com/scrapy/scrapy/blob/b8594353d03be5574f51766c35566b713584302b/scrapy/core/downloader/__init__.py#L154-L155 This can be solved with following changes:

 while slot.queue and slot.free_transfer_slots() > 0: 
     slot.lastseen = max(now,slot.lastseen) 

Examples of usage slot.latercall and slot.lastseen approaches also presented in this gist code sample (slot.latercall - commented code lines in parse method, slot.lastseen - commented code lins in start_requests method.)

atlowell-smpl commented 4 years ago

Bumping this. Having different download delays would be incredibly useful. For instance, you could have one delay that is performed between entry points (start_urls), one delay that is performed between individual pages, and one delay that is used to handle data obtained from ajax requests (such as data that is loaded by button presses on a single page).

netcaf commented 3 years ago

Hello,

How about a method through a downloader middleware?

  1. Create one custom middleware
class CustomDownloadDelayMiddleware:
    _CUSTOM_DOWNLOAD_SLOT = '__custom_download_slot_{}__'
    def __init__(self, crawler):
        self.crawler = crawler
        crawler.signals.connect(self._request_scheduled, signal=signals.request_scheduled)
        crawler.signals.connect(self._request_reached_downloader, signal=signals.request_reached_downloader)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def _request_reached_downloader(self, request, spider):
        delay = request.meta.get('custom_download_delay')
        if delay:
            slot_name =  self._CUSTOM_DOWNLOAD_SLOT.format(delay)
            self.crawler.engine.downloader.slots[slot_name].delay = delay

    def _request_scheduled(self, request, spider):
        delay = request.meta.get('custom_download_delay')
        if delay:
            slot_name =  self._CUSTOM_DOWNLOAD_SLOT.format(delay)
            request.meta.setdefault('download_slot', slot_name)
  1. Set the delay in Spider.
yield scrapy.Request(url=url, callback=self.parse, meta={"custom_download_delay": 1})
Gallaecio commented 3 years ago

I don’t think that per-delay-length slots solve the per-request delay issue.

caffeinatedMike commented 3 years ago

Also bumping this because it would be an incredible feature for broad crawls where there are certain domains you know get overloaded more easily than others.

Gallaecio commented 3 years ago

@caffeinatedMike For that per-slot delays, and not per-request delays, would be what you want, I believe. That is relatively easy to implement with a custom Scrapy extension.

For example:

from scrapy import signals
from scrapy.core.downloader import Slot

class ThrottlingPerSlotController:

    def __init__(self, crawler, *args, **kwargs):
        self.crawler = crawler
        crawler.signals.connect(
            self._response_downloaded, signal=signals.response_downloaded
        )
        crawler.signals.connect(
            self._spider_opened, signal=signals.spider_opened
        )
        self._pending_delays = None

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def _spider_opened(self, spider):
        slots = self.crawler.settings.getdict("THROTTLING_PER_SLOT") or {}
        for key, settings in slots.items():
            self.crawler.engine.downloader.slots[key] = Slot(
                concurrency=settings.get(
                    'concurrency',
                    self.crawler.settings.getint(
                        'CONCURRENT_REQUESTS_PER_DOMAIN'
                    )
                ),
                delay=0,  # The first request should not be delayed.
                randomize_delay=settings.get(
                    'randomize_delay',
                    self.crawler.settings.getbool(
                        'RANDOMIZE_DOWNLOAD_DELAY'
                    )
                ),
            )
        self._pending_delays = {
            key: settings['download_delay']
            for key, settings in slots.items()
            if 'download_delay' in settings
        }

    def _response_downloaded(self, response, request, spider):
        key = request.meta.get("download_slot")
        if key not in self._pending_delays:
            return
        slot = self.crawler.engine.downloader.slots[key]
        slot.delay = self._pending_delays.pop(key)

When enabled, this extensions allows using a THROTTLING_PER_SLOT setting to define a dictionary where keys are slots (domain names) and values are dictionaries with custom concurrency, delay and randomize_delay (optional) values.

caffeinatedMike commented 3 years ago

@Gallaecio Thanks so much for this! But, I actually want to be able to control the throttling per-request (once the site starts throwing 502s), ideally by setting a meta value. Up to that point I don't want it throttled.

caffeinatedMike commented 3 years ago

@Gallaecio I tried implementing this as a hold-over until I find some sort of per-request solution, but it doesn't seem to be doing anything. Any ideas?

The only addition I made to your code was a single line for logging to see if it was actually working

    def _response_downloaded(self, response, request, spider):
        key = request.meta.get("download_slot")
        if key not in self._pending_delays:
            return
        slot = self.crawler.engine.downloader.slots[key]
        print(f'[delayed_requests] delaying request for "{key}"')
        slot.delay = self._pending_delays.pop(key)

I've added the extension to my extensions.py file in my project and have the below settings on my spider

class LabelSpider(CrawlSpider):
    name = 'label'
    custom_settings = {
        # set retry times arbitrarily high in combination with slot-delay to counteract overloading servers
        # basically just keep retrying until it's able to respond again
        "RETRY_TIMES": 25,
        "LOG_FILE": "label_runtime.log",
        "EXTENSIONS": {
            "scrapy.extensions.spiderstate.SpiderState": None,
            "myproj.extensions.SpiderStateManager": 0,
            "myproj.extensions.ThrottlingPerSlotController": 0
        },
        "THROTTLING_PER_SLOT": {
            "DOMAINOMITTED.com": {
                "download_delay": 2.5,
                "randomize_delay": True
            }
        },
        "ITEM_PIPELINES": {
            "myproj.pipelines.NutrientMappingPipeline": 300,
            "myproj.pipelines.SQLitePipeline": 500
        },
        # Performance tweaks for broad crawls, courtesy of https://docs.scrapy.org/en/latest/topics/broad-crawls.html
        "COOKIES_ENABLED": False,
        "SCHEDULER_PRIORITY_QUEUE": 'scrapy.pqueues.DownloaderAwarePriorityQueue',
        "CONCURRENT_REQUESTS": 100,
        "REACTOR_THREADPOOL_MAXSIZE": 20
    }
Gallaecio commented 3 years ago

I think we should not debug that extension in comments here, as the code is not directly related to the original issue.

If the extension does not work for you, please debug it yourself, and raise in StackOverflow with a minimum, reproducible example any problem that you do not understand.

If you find a bug in the extension, and you are willing to share a fix, fell free to share the updated extension code here.

realslimshanky-sh commented 3 years ago

Hi, I was facing a similar issue where I needed to delay an individual request and I came up with this solution, let me know what you think about this. Thanks.

WinterComes commented 3 years ago

Hello, I've created a PR with scheduler-side support of per request delays. Briefly it:

  1. allows to process a request no sooner than X seconds from now;
  2. respects rate and concurrency limits
  3. allows to avoid of updating all queues to have delays support;
  4. has a way for scheduler/queue to say "I have some requests, so don't stop the spider, but I won't give you anything now";
  5. Doesn't block downloader, so the other "not-delayed" requests can be processed in the meantime.

A big wall of text with details inside the description :)