rejoiceinhope / scrapy-proxy-pool

164 stars 33 forks source link

scrapy-proxy-pool

Installation

::

pip install scrapy_proxy_pool

Usage

Enable this middleware by adding the following settings to your settings.py::

PROXY_POOL_ENABLED = True

Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES::

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

After this all requests will be proxied using proxies.

Requests with "proxy" set in their meta are not handled by scrapy-proxy-pool. To disable proxying for a request set request.meta['proxy'] = None; to set proxy explicitly use request.meta['proxy'] = "<my-proxy-address>".

Concurrency

By default, all default Scrapy concurrency options (DOWNLOAD_DELAY, AUTHTHROTTLE_..., CONCURRENT_REQUESTS_PER_DOMAIN, etc) become per-proxy for proxied requests when RotatingProxyMiddleware is enabled. For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2 then spider will be making at most 2 concurrent connections to each proxy, regardless of request url domain.

Customization

scrapy-proxy-pool keeps track of working and non-working proxies from time to time.

Detection of a non-working proxy is site-specific. By default, scrapy-proxy-pool uses a simple heuristic: if a response status code is not 200, 301, 302, 404, 500, response body is empty or if there was an exception then proxy is considered dead.

You can override ban detection method by passing a path to a custom BanDectionPolicy in PROXY_POOL_BAN_POLICY option, e.g.::

# settings.py
PROXY_POOL_BAN_POLICY = 'myproject.policy.MyBanPolicy'

The policy must be a class with response_is_ban and exception_is_ban methods. These methods can return True (ban detected), False (not a ban) or None (unknown). It can be convenient to subclass and modify default BanDetectionPolicy::

# myproject/policy.py
from scrapy_proxy_pool.policy import BanDetectionPolicy

class MyPolicy(BanDetectionPolicy):
    def response_is_ban(self, request, response):
        # use default rules, but also consider HTTP 200 responses
        # a ban if there is 'captcha' word in response body.
        ban = super(MyPolicy, self).response_is_ban(request, response)
        ban = ban or b'captcha' in response.body
        return ban

    def exception_is_ban(self, request, exception):
        # override method completely: don't take exceptions in account
        return None

Instead of creating a policy you can also implement response_is_ban and exception_is_ban methods as spider methods, for example::

class MySpider(scrapy.Spider):
    # ...

    def response_is_ban(self, request, response):
        return b'banned' in response.body

    def exception_is_ban(self, request, exception):
        return None

It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.

Settings