scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
BSD 3-Clause "New" or "Revised" License
357 stars 88 forks source link

Define banned outcome based on code + message #39

Closed tomasrinke closed 5 years ago

tomasrinke commented 7 years ago

As seen here: https://doc.scrapinghub.com/crawlera.html#errors

503 could mean multiple errors, not just a ban:

X-Crawlera-Error    Response Code   Error Message
...
noslaves    503 No available proxies
slavebanned 503 Website crawl ban
serverbusy  503 Server busy: too many outstanding requests
...

scrapy-crawlera only checks for the code, and could be misleading.

      if response.status == self.ban_code:
            self._bans[key] += 1
            if self._bans[key] > self.maxbans:
                self.crawler.engine.close_spider(spider, 'banned')
            else:
                after = response.headers.get('retry-after')
                if after:
                    self._set_custom_delay(request, float(after))
            self.crawler.stats.inc_value('crawlera/response/banned')
        else:

IMHO it should consider the message of the response as well: HTTP code 503 and "Proxy has been banned"

I discovered that this is the output of scrapy:

{'crawlera/request': 410730,
 'crawlera/request/method/GET': 410730,
 'crawlera/response': 410412,
 'crawlera/response/banned': 433,
 'crawlera/response/error': 48,
 'crawlera/response/error/banned': 15,
 'crawlera/response/error/internal_error': 15,
 'crawlera/response/error/timeout': 18,
 'crawlera/response/status/200': 409414,
 'crawlera/response/status/400': 514,
 'crawlera/response/status/403': 3,
 'crawlera/response/status/500': 15,
 'crawlera/response/status/502': 15,
 'crawlera/response/status/503': 433,
 'crawlera/response/status/504': 18,

and Crawlera stats show only 15 errors with 503 and "Proxy has been banned" which matches this count 'crawlera/response/error/banned'

redapple commented 7 years ago

It would be interesting but right now Scrapy does not report on the status phrase: https://github.com/scrapy/scrapy/blob/f01ae6ffcd431b73f5358f9f876f8e9ee9be0113/scrapy/core/downloader/handlers/http11.py#L360 so the info accompanying the 503 is not available at the middleware level.

redapple commented 7 years ago

my bad. X-Crawlera-Error header does have some information.

starrify commented 7 years ago

so the info accompanying the 503 is not available at the middleware level.

IIRC it's available via the X-Crawlera-Error header