Can one ScrapySplash server handle multiple SpashRequests from multiple Spiders?

scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API

BSD 3-Clause "New" or "Revised" License

4.04k stars 508 forks source link

Can one ScrapySplash server handle multiple SpashRequests from multiple Spiders? #1098

Open fritz-0 opened 3 years ago

fritz-0 commented 3 years ago

I have a Spider that crawls multiple URLs (100+) and I plan using both Requests and SplashRequests as the website needs JavaScript rendering. And, I am thinking of running Spiders simultaneously for the URLs.

Please note that my Spider does a lot of series of callbacks.

Is ScrapySplash capable of handling the simultaneous requests that can come from the multiple Spiders?

That is, even if there is only on Scrapy-Splash running (via Docker) on localhost:8050? If yes, how can I implement such or are there better alternatives?

lopuhin commented 3 years ago

Is ScrapySplash capable of handling the simultaneous requests that can come from the multiple Spiders?

You mean just "splash", not "scrapy-splash" library, right?

If yes then it would work, but there is a limit on how many requests can a splash handle in parallel, IIRC usually 10 requests max are processed and the rest are queued, so it if you throw too many requests at once you'll be getting errors from splash. If you face this problem, there are two possible solutions here - reduce concurrency on the spider side, or deploy more splash processes, e.g. with https://github.com/TeamHG-Memex/aquarium/ or using some hosted splash which already handles this.

fritz-0 commented 3 years ago

@lopuhin Actually, I was referring to "scrapy-splash" I thought these were the same. Apologies. But, anyway, thank you for clarifying still the concurrency of "Splash" and that there is a sort of limit to it still especially if there are too many request.

In relation to what you said about having too many request, I am looking into creating more Docker + Scrapy-Splash per Spider in this case then.

Gallaecio commented 3 years ago

Actually, I was referring to "scrapy-splash"

I really think you meant Splash, that’s what runs in Docker. But it’s common to mix it up with scrapy-splash, which is just a Scrapy plugin/client for Splash.

lopuhin commented 3 years ago

As @Gallaecio said, scrapy-splash is a scrapy client for splash, and does not impose any extra restrictions on the number of concurrent requests - but splash considerations still apply.