Crawlera + Splash very slow

scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API

BSD 3-Clause "New" or "Revised" License

4.09k stars 513 forks source link

Crawlera + Splash very slow #761

Open farazirfan47 opened 6 years ago

farazirfan47 commented 6 years ago

Hi, I have integrated crawlera with splash and now the response is really slow, I have increased timeout limit. Please let me know how can I improve my request speed when use crawlera as proxy with splash. Is it good choice using crawlera with splash ?

landoncope commented 6 years ago

Have you looked into Crawlera throttling?

farazirfan47 commented 6 years ago

Yes, I disabled ads and some unessential resources but It did not help much.

lopuhin commented 6 years ago

Hi @farazirfan47 getting good performance from Splash + Crawlera is tricky indeed. HAR (http://splash.readthedocs.io/en/stable/scripting-ref.html#splash-har) can help with diagnosing the issue. One problem that I have seen is that our current example script re-uses sessions, which adds a 12 second delay between subsequent requests when rendering one page. If this is the issue for you indeed (it will be clear from HAR output), then there are two ways to solve it: (1) don't use sessions (2) make only the first request via crawlera (the rest are usually static)

farazirfan47 commented 6 years ago

I have analysed HAR output and its clearly shows some of the web page resources taking too while. Can I stop splash for using crawlera when it download web page resources ?

lopuhin commented 6 years ago

Can I stop splash for using crawlera when it download web page resources ?

@farazirfan47 yes, please see this example (code is commented out): https://github.com/scrapinghub/sample-projects/blob/0a9779cac4564d24c082e4973534f36f33eb75d3/splash_crawlera_example/splash_crawlera_example/scripts/crawlera.lua#L18-L31 - this is from the guide https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash

farazirfan47 commented 6 years ago

I tried disabling the unessential resources but performance still nor good, first of all Its hard to find unessential resources links then apply filter. I am dealing with 40+ sites and all of them I have to write separate rules which is time consuming task.

lopuhin commented 6 years ago

@farazirfan47 are resources for one page downloaded in parallel or sequentially?

paunovic commented 6 years ago

Bump. Has anyone figured out good solution for this? I've integrated Crawlera + Splash but it's incredibly slow, take more than few minutes to load a web page. I've limited concurrent requests to 10 in Scrapy and Splash due to Crawlera basic plan limits.