scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.16k stars 452 forks source link
headless-browsers scrapy

============================================== Scrapy & JavaScript integration through Splash

.. image:: https://img.shields.io/pypi/v/scrapy-splash.svg :target: https://pypi.python.org/pypi/scrapy-splash :alt: PyPI Version

.. image:: https://github.com/scrapy-plugins/scrapy-splash/workflows/Tests/badge.svg :target: https://github.com/scrapy-plugins/scrapy-splash/actions/workflows/tests.yml :alt: Test Status

.. image:: http://codecov.io/github/scrapy-plugins/scrapy-splash/coverage.svg?branch=master :target: http://codecov.io/github/scrapy-plugins/scrapy-splash?branch=master :alt: Code Coverage

This library provides Scrapy and JavaScript integration using Splash. The license is BSD 3-clause.

.. _Scrapy: https://github.com/scrapy/scrapy .. _Splash: https://github.com/scrapinghub/splash

Installation

Install scrapy-splash using pip::

$ pip install scrapy-splash

Scrapy-Splash uses Splash_ HTTP API, so you also need a Splash instance. Usually to install & run Splash, something like this is enough::

$ docker run -p 8050:8050 scrapinghub/splash

Check Splash install docs_ for more info.

.. _install docs: http://splash.readthedocs.org/en/latest/install.html

Configuration

  1. Add the Splash server address to settings.py of your Scrapy project like this::

    SPLASH_URL = 'http://192.168.59.103:8050'

  2. Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file and changing HttpCompressionMiddleware priority::

    DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }

    Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

    HttpCompressionMiddleware priority should be changed in order to allow advanced response processing; see https://github.com/scrapy/scrapy/issues/1895 for details.

  3. Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py::

    SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }

    This middleware is needed to support cache_args feature; it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.

  4. Set a custom DUPEFILTER_CLASS::

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

  5. If you use Scrapy HTTP cache then a custom cache storage backend is required. scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage::

    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

    If you use other cache storage then it is necesary to subclass it and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint.

.. note::

Steps (4) and (5) are necessary because Scrapy doesn't provide a way
to override request fingerprints calculation algorithm globally; this
could change in future.

There are also some additional options available. Put them into your settings.py if you want to change the defaults:

Usage

Requests

The easiest way to render requests with Splash is to use scrapy_splash.SplashRequest::

yield SplashRequest(url, self.parse_result,
    args={
        # optional; parameters passed to Splash HTTP API
        'wait': 0.5,

        # 'url' is prefilled from request url
        # 'http_method' is set to 'POST' for POST requests
        # 'body' is set to request body for POST requests
    },
    endpoint='render.json', # optional; default is render.html
    splash_url='<url>',     # optional; overrides SPLASH_URL
    slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN,  # optional
)

Alternatively, you can use regular scrapy.Request and 'splash' Request meta key::

yield scrapy.Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
            # 'http_method' is set to 'POST' for POST requests
            # 'body' is set to request body for POST requests
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # optional; overrides SPLASH_URL
        'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
        'splash_headers': {},       # optional; a dict with headers sent to Splash
        'dont_process_response': True, # optional, default is False
        'dont_send_headers': True,  # optional, default is False
        'magic_response': False,    # optional, default is True
    }
})

Use request.meta['splash'] API in middlewares or when scrapy.Request subclasses are used (there is also SplashFormRequest described below). For example, meta['splash'] allows to create a middleware which enables Splash for all outgoing requests by default.

SplashRequest is a convenient utility to fill request.meta['splash']; it should be easier to use in most cases. For each request.meta['splash'] key there is a corresponding SplashRequest keyword argument: for example, to set meta['splash']['args'] use SplashRequest(..., args=myargs).

.. _HTTP API docs: http://splash.readthedocs.org/en/latest/api.html

Use scrapy_splash.SplashFormRequest if you want to make a FormRequest via splash. It accepts the same arguments as SplashRequest, and also formdata, like FormRequest from scrapy::

>>> SplashFormRequest('http://example.com', formdata={'foo': 'bar'})
<POST http://example.com>

SplashFormRequest.from_response is also supported, and works as described in scrapy documentation <http://scrapy.readthedocs.org/en/latest/topics/request-response.html#scrapy.http.FormRequest.from_response>_.

Responses

scrapy-splash returns Response subclasses for Splash requests:

To use standard Response classes set meta['splash']['dont_process_response']=True or pass dont_process_response=True argument to SplashRequest.

All these responses set response.url to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Splash endpoint. "True" URL is still available as response.real_url.

SplashJsonResponse provide extra features:

When response.body is updated in SplashJsonResponse (either from 'html' or from 'body' keys) familiar response.css and response.xpath methods are available.

To turn off special handling of JSON result keys either set meta['splash']['magic_response']=False or pass magic_response=False argument to SplashRequest.

Session Handling

Splash itself is stateless - each request starts from a clean state. In order to support sessions the following is required:

  1. client (Scrapy) must send current cookies to Splash;
  2. Splash script should make requests using these cookies and update them from HTTP response headers or JavaScript code;
  3. updated cookies should be sent back to the client;
  4. client should merge current cookies wiht the updated cookies.

For (2) and (3) Splash provides splash:get_cookies() and splash:init_cookies() methods which can be used in Splash Lua scripts.

scrapy-splash provides helpers for (1) and (4): to send current cookies in 'cookies' field and merge cookies back from 'cookies' response field set request.meta['splash']['session_id'] to the session identifier. If you only want a single session use the same session_id for all request; any value like '1' or 'foo' is fine.

For scrapy-splash session handling to work you must use /execute endpoint and a Lua script which accepts 'cookies' argument and returns 'cookies' field in the result::

function main(splash) splash:init_cookies(splash.args.cookies)

   -- ... your script

   return {
       cookies = splash:get_cookies(),
       -- ... other results, e.g. html
   }

end

SplashRequest sets session_id automatically for /execute endpoint, i.e. cookie handling is enabled by default if you use SplashRequest, /execute endpoint and a compatible Lua rendering script.

If you want to start from the same set of cookies, but then 'fork' sessions set request.meta['splash']['new_session_id'] in addition to session_id. Request cookies will be fetched from cookiejar session_id, but response cookies will be merged back to the new_session_id cookiejar.

Standard Scrapy cookies argument can be used with SplashRequest to add cookies to the current Splash cookiejar.

Examples

Get HTML contents::

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # ...

Get HTML contents and a screenshot::

import json
import base64
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):

    # ...
        splash_args = {
            'html': 1,
            'png': 1,
            'width': 600,
            'render_all': 1,
        }
        yield SplashRequest(url, self.parse_result, endpoint='render.json',
                            args=splash_args)

    # ...
    def parse_result(self, response):
        # magic responses are turned ON by default,
        # so the result under 'html' key is available as response.body
        html = response.body

        # you can also query the html result as usual
        title = response.css('title').extract_first()

        # full decoded JSON data is available as response.data:
        png_bytes = base64.b64decode(response.data['png'])

        # ...

Run a simple Splash Lua Script_::

import json
import base64
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):

    # ...
        script = """
        function main(splash)
            assert(splash:go(splash.args.url))
            return splash:evaljs("document.title")
        end
        """
        yield SplashRequest(url, self.parse_result, endpoint='execute',
                            args={'lua_source': script})

    # ...
    def parse_result(self, response):
        doc_title = response.text
        # ...

More complex Splash Lua Script_ example - get a screenshot of an HTML element by its CSS selector (it requires Splash 2.1+). Note how are arguments passed to the script::

import json
import base64
from scrapy_splash import SplashRequest

script = """
-- Arguments:
-- * url - URL to render;
-- * css - CSS selector to render;
-- * pad - screenshot padding size.

-- this function adds padding around region
function pad(r, pad)
  return {r[1]-pad, r[2]-pad, r[3]+pad, r[4]+pad}
end

-- main script
function main(splash)

  -- this function returns element bounding box
  local get_bbox = splash:jsfunc([[
    function(css) {
      var el = document.querySelector(css);
      var r = el.getBoundingClientRect();
      return [r.left, r.top, r.right, r.bottom];
    }
  ]])

  assert(splash:go(splash.args.url))
  assert(splash:wait(0.5))

  -- don't crop image by a viewport
  splash:set_viewport_full()

  local region = pad(get_bbox(splash.args.css), splash.args.pad)
  return splash:png{region=region}
end
"""

class MySpider(scrapy.Spider):

    # ...
        yield SplashRequest(url, self.parse_element_screenshot,
            endpoint='execute',
            args={
                'lua_source': script,
                'pad': 32,
                'css': 'a.title'
            }
         )

    # ...
    def parse_element_screenshot(self, response):
        image_data = response.body  # binary image data in PNG format
        # ...

Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+)::

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):

    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
            headers={'X-My-Header': 'value'},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

.. _Splash Lua Script: http://splash.readthedocs.org/en/latest/scripting-tutorial.html

HTTP Basic Auth

If you need to use HTTP Basic Authentication to access Splash, use the SPLASH_USER and SPLASH_PASS optional settings::

SPLASH_USER = 'user'
SPLASH_PASS = 'userpass'

Another option is meta['splash']['splash_headers']: it allows to set custom headers which are sent to Splash server; add Authorization header to splash_headers if you want to change credentials per-request::

import scrapy
from w3lib.http import basic_auth_header

class MySpider(scrapy.Spider):
    # ...
    def start_requests(self):
        auth = basic_auth_header('user', 'userpass')
        yield SplashRequest(url, self.parse,
                            splash_headers={'Authorization': auth})

WARNING: Don't use HttpAuthMiddleware_ (i.e. http_user / http_pass spider attributes) for Splash authentication: if you occasionally send a non-Splash request from your spider, you may expose Splash credentials to a remote website, as HttpAuthMiddleware sets credentials for all requests unconditionally.

.. _HttpAuthMiddleware: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpauth

Why not use the Splash HTTP API directly?

The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API_. Take a look at the example below and make sure to read the observations after it::

import json

import scrapy
from scrapy.http.headers import Headers

RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
            headers = Headers({'Content-Type': 'application/json'})
            yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
                                 body=body, headers=headers)

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # ...

It works and is easy enough, but there are some issues that you should be aware of:

  1. There is a bit of boilerplate.

  2. As seen by Scrapy, we're sending requests to RENDER_HTML_URL instead of the target URLs. It affects concurrency and politeness settings: CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain.

  3. As seen by Scrapy, response.url is an URL of the Splash server. scrapy-splash fixes it to be an URL of a requested page. "Real" URL is still available as response.real_url. scrapy-splash also allows to handle response.status and response.headers transparently on Scrapy side.

  4. Some options depend on each other - for example, if you use timeout_ Splash option then you may want to set download_timeout scrapy.Request meta key as well.

  5. It is easy to get it subtly wrong - e.g. if you won't use sort_keys=True argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly.

  6. Default Scrapy duplication filter doesn't take Splash specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.

  7. Splash Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. SplashMiddleware logs content of HTTP 400 Splash responses by default (it can be turned off by setting SPLASH_LOG_400 = False option).

  8. Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Splash.

  9. Large Splash arguments which don't change with every request (e.g. lua_source) may take a lot of space when saved to Scrapy disk request queues. scrapy-splash provides a way to store such static parameters only once.

  10. Splash 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper save_args and load_args values and handle HTTP 498 responses.

scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.

.. _HTTP API: http://splash.readthedocs.org/en/latest/api.html .. _timeout: http://splash.readthedocs.org/en/latest/api.html#arg-timeout

Getting help

Best approach to get any other help is to ask a question on Stack Overflow_

.. _reporting Scrapy bugs: https://doc.scrapy.org/en/master/contributing.html#reporting-bugs .. _Splash FAQ: http://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly .. _Stack Overflow: https://stackoverflow.com/questions/tagged/scrapy-splash?sort=frequent&pageSize=15&mixed=1

Contributing

Source code and bug tracker are on github: https://github.com/scrapy-plugins/scrapy-splash

To run tests, install "tox" Python package and then run tox command from the source checkout.

To run integration tests, start Splash and set SPLASH_URL env variable to Splash address before running tox command::

docker run -d --rm -p8050:8050 scrapinghub/splash:3.0 SPLASH_URL=http://127.0.0.1:8050 tox -e py36