scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

render.png bug #138

Closed IaroslavR closed 4 years ago

IaroslavR commented 7 years ago

Python 3.6.2, Scrapy 1.4.0, scrapy_splash 0.7.2, splash from docker pull scrapinghub/splash
Spider:

import scrapy
from scrapy_splash import SplashRequest

class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        splash_args = {
            'png': 1,
            'render_all': 1,
            'wait': 2,
        }
        url = 'https://google.com'
        yield SplashRequest(
            url,
            callback=self.parse_splash,
            endpoint='render.png',
            args=splash_args
        )
        yield scrapy.Request(
            f"http://localhost:8050/render.png?url={url}&wait=2&render_all=1",
            self.parse_request,
        )

    def parse_request(self, response):
        with open('request.png', 'wb') as f:
            f.write(response.body)

    def parse_splash(self, response):
        with open('splash.png', 'wb') as f:
            f.write(response.body)

with scrapy.Request all ok, but in splash.png I see garbage instead of the Google page screenshot.

Gallaecio commented 5 years ago

Your example is using Scrapy-Splash without configuring it first. Please, check the README.

Adding this to the spider makes it work:

class TestSpider(scrapy.Spider):

    # …

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    # …