scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

504 Gateway Time-out #28

Closed odmaaa closed 8 years ago

odmaaa commented 8 years ago

Hello, I am crawling a website with 10K contents, when I crawl first it's all response 200, everything is ok, but after few minutes 504 Gateway Time-out appears and after 3 times retrying scrapy give up retrying. I set :

    'CONCURRENT_REQUESTS':10,
    'HTTPCACHE_ENABLED':True,
    'DOWNLOAD_DELAY':5,
    'CONCURRENT_REQUESTS_PER_IP':10,

and endpoint is render.html

'splash' : {
    'endpoint' : 'render.html',
    'args' : {'wait':1},
}

I am using : scrapy version: 1.0.3 python:2.7 *docker server

How can I optimize my crawler ? and avoid 504 error?

kmike commented 8 years ago

Hey @omkaaa,

Please check http://splash.readthedocs.org/en/stable/faq.html - does it help?

odmaaa commented 8 years ago

Hi @kmike ,

Yes it helped thank you,I disabled the image and set time-out to 720, all worked great. Thank you

kmike commented 8 years ago

@omkaaa glad to hear that!

yeszao commented 5 years ago

Follow @omkaaa, I change the args to:

yield SplashRequest(url, self.parse, args={'wait': 0.5, 'viewport': '1024x2480', 'timeout':90, 'images': 0}

It works!

Besides, some website would very quick when you using curl or Browser, but very slow in splash, because splash cannot download some resources currectly.

These can also come across with 504 Gateway Time-out. The right way is stop the slow resource download. in Splash, you can set resource_timeout in args:

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,
                                self.parse,
                                args={'wait': 0.5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0, 'resource_timeout': 10},
                                )
kamrankausar commented 3 years ago

Thanks @yeszao it works