scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

Performance compared to PhantomJS #99

Open andr0s opened 10 years ago

andr0s commented 10 years ago

Hi,

did you perform any benchmarks? How is it compared to, say, PhantomJS? In particular, CPU and memory consumption.

I'm asking because running effectively over 100 parallel phantomjs instances is almost impossible even on a good hardware.

Thanks!

ericvalente commented 9 years ago

Hi andr0s,

Did you ever get any feedback on this? I'm currently migrating my Scrapy spiders from selenium/phantomjs to splash and I'm wondering if it is worth the effort.

andr0s commented 9 years ago

@ericvalente from what I understood, the main bottleneck is WebKit and we can't do much here. Probably doesn't worth it.

kmike commented 9 years ago

Hi @ericvalente and @andr0s,

I haven't run any comprehensive tests, but performance shouldn't be that different. PhantomJS 1.x and current Splash use the same WebKit version so webpage rendering should have the same speed.

Splash can be faster in some cases because of two reasons:

  1. there is no startup overhead;
  2. there is in-memory cache shared between requests.

To get most of Splash it is better to start several instances (one per CPU core?) and load balance between them because a single Splash instance can saturate only 1 CPU core.

PhantomJS 2.x (not released yet) may be faster because it uses a more modern WebKit version. We also have plans to switch to a more modern WebKit, but it is not implemented yet, and there is no schedule for this feature.

ericvalente commented 9 years ago

This is very helpful, thank you @kmike .

kmike commented 9 years ago

You can also use AdBlock Plus filters to speedup rendering, Splash supports https://easylist.adblockplus.org/en/ filters - sometimes ads and social widgets are not interesting.

AlexIzydorczyk commented 9 years ago

Hi @kmike,

Any examples of a load balancer for Splash or plans to release one?

kmike commented 9 years ago

Hi @AlexIzydorczyk,

The idea is to configure either Nginx or HAproxy. There are plans to add example config files to the Splash repo.

The tricky part is that to load balance Splash robustly you need to create a queue in the load balancer itself, and limit a number of parallel rendering requests sent to a single Splash instance to the number of Splash slots (set at startup using --slots option). Plus the usual stuff like failover. This way you won't lose any requests when a single Splash instance is restarting, and won't overload instances so they won't start returning 504 errors for most requests.

HAproxy should have all necessary features. Nginx can do it only in its commercial version, using max_conns parameter. Maybe we'll create a custom Lua Nginx module to do that.

kmike commented 9 years ago

PhantomJS 2.x (not released yet) may be faster because it uses a more modern WebKit version. We also have plans to switch to a more modern WebKit, but it is not implemented yet, and there is no schedule for this feature.

An update: there is a qt5 branch which uses a more modern WebKit (released in 2013) - the same engine as PhantomJS 2.x. It passes all tests, but lacks e.g. an updated Dockerfile; the idea is to make one more qt4-based stable release before meging this branch.

AlexIzydorczyk commented 9 years ago

Hi @kmike

Thanks - this is very helpful.

In terms of an temporary solution in the meantime, I was thinking of setting up Splash instances in docker containers all confined to a different, single core (from what I've read, seems like Splash can only use 1 core anyway). Given that all the pages I'm rendering are exactly the same, I figure that I can then just get Scrapy to send requests to each of these instances in equal amounts. At the moment, with a single Scrapy instance, I'm getting HTTP error 503 - I'm assuming that means I'm overloading the instance...

kmike commented 9 years ago

If you have a signle Scrapy instance you can throttle requests from Scrapy side. If you use https://github.com/scrapinghub/scrapyjs, set slot_policy to scrapyjs.SlotPolicy.SINGLE_SLOT and reduce CONCURRENT_REQUESTS_PER_DOMAIN until there are no timeout errors.

tamoyal commented 8 years ago

@kmike You mentioned running one instance per core is the recommended setup but installation with docker doesn't set this up. I wanted to follow up because it's been about a year since this issue has been commented on. Is this still the recommended setup?

kmike commented 8 years ago

Hey @tamoyal! Yes, running Splash behind HAProxy is still a recommended setup. Check https://github.com/TeamHG-Memex/aquarium for an example; probably it is possible to do something even better with docker-swarm.

jkryanchou commented 7 years ago

👍