scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.06k stars 513 forks source link

Add a compossible way to augment splash (addons?) #283

Open dvdbng opened 9 years ago

dvdbng commented 9 years ago

When trying to augment splash functionality from a scrapy middleware I found two problems:

Composability: When using scrapy middlewares, if two different scrapy middlewares want to extend splash by using lua code and the /execute endpoint the scripts will overwrite each other.

Verbosity: If you want to use a lua or content script you need to send it again for every request even if it's the same.

Here are a few solutions I can think of:

  1. Keep splash like it is, provide some common libraries and modules in a separate repository.

    The splash users can download them if they want and they would be pre-installed in the hosted splash version.

    This would solve the verbosity problem and make it a bit easier on the composability one, since a most of the code could be sent as a lua module that it's loaded in the main script.

  2. Allow "saving" and loading of splash scripts and content scripts.

    You could send the same script once to splash and splash would cache it returning a reference to load it again

    This would give problems when splash is behind a load balancer, I initially liked this option but the load balancer problem would be to complicated to get around.

  3. Allow passing an array with several scripts as the lua_script parameter

    The first scripts would do whatever they needed and typically the last one would load the page and return the response data.

    This solves the composability problem but not the verbosity one.

  4. Have "splash" addons. Addons would be lua files that would add some features. You select what addons you want to use by sending the names in a parameter.

    This solves both the composability problem and the verbosity problem, but we still need to provide a library of addons.

    There are two sub-options.

    • 4a: Addons only work work in the /execute endpoint. This is my favourite option, it solves the problem without being excessively complicated.
    • 4b: Addons work in every endpoint. I suspect this would be a pretty large rewrite of splash.

Some Ideas for addons/modules:

kmike commented 9 years ago

(1) Keep splash like it is, provide some common libraries and modules in a separate repository.

Sorry, could you please elaborate on how is it different from 4a?

(2) Allow "saving" and loading of splash scripts and content scripts.

This can be done in a way similar to redis EVAL / EVALSHA (http://redis.io/commands/EVAL): you send a script and get a hash back, then you can use this hash to preload the script. Splash checks if hash exists, and if script is not found it tells client to send the script again.

It is a solution to save network traffic without requiring an access to server.

(3) Allow passing an array with several scripts as the lua_script parameter

Isn't is the same as passing "\n".join(scripts)? If so, it can be a client-side solution.

(4) Have "splash" addons. Addons would be lua files that would add some features. You select what addons you want to use by sending the names in a parameter.

Currently you can create such "addons" as Lua modules (see http://splash.readthedocs.org/en/latest/scripting-tutorial.html#custom-lua-modules). They only work in /execute endpoint. Instead of auto-loading user must require them in a script explicitly. A module can provide a set of useful functions (like crawlera.enable()) or just execute some code (e.g. add a splash:on_request handler) as an import side effect. Is it what you want in 4a?

AlexIzydorczyk commented 9 years ago

@kmike, just curious, when you use the /execute endpoint, do you typically create lua modules for your scripts? Or do you pass the script as a very long URL arg to Splash? I imagine the former is a better practice?

kmike commented 9 years ago

@AlexIzydorczyk at the moment I'm passing very long URLs, because of deployment reasons. Server-side scripts is a better practice, but they are harder to use if you have a shared Splash cluster for various tasks.

AlexIzydorczyk commented 9 years ago

@kmike got it, that's where I'm at too. I have multiple jobs running on a Splash cluster and just have every individual request bring it's own script. Was just curious how you approached it.

The HAproxy config file was extremely helpful by the way, works very well.

kmike commented 9 years ago

The HAproxy config file was extremely helpful by the way, works very well.

Great! See also: https://github.com/TeamHG-Memex/aquarium (it is a work in progress though).

AlexIzydorczyk commented 9 years ago

@kmike - very cool, interestingly enough, I've been working on something similar myself.. Many of these features don't fit into Splash itself but would work for this.

On a different note, who or what is TeamHG? I've and use a lot of TeamHG things (mostly found through the DARPA open source Memex page) that are very useful...

kmike commented 9 years ago

@AlexIzydorczyk Feel free to contribute :) TeamHG means "Team Hyperion Gray"; see http://blog.scrapinghub.com/2015/02/24/memex/

@Youwotma sorry for having a chat here! I guess the question is who can extend Splash with addons. If it is an user who runs Splash instance then we should make it easier to use server-side Lua modules (suggestions are welcome, my take on it was 'Custom Lua Modules' feature). If it is an user of Splash instance we should think of a way to make it easier to pass scripts (suggestions are welcome).

What I don't want to do is to make Splash stateful. An API where user can submit a script via HTTP and expect it to be persisted forever doesn't sound good to me because it requires a centralized storage.

kmike commented 9 years ago

@Youwotma see also: https://github.com/scrapinghub/splash/issues/143.

dvdbng commented 9 years ago

(1) Keep splash like it is, provide some common libraries and modules in a separate repository.

Sorry, could you please elaborate on how is it different from 4a?

In option 1 you still need to require the module in the lua script. In option 4a you pass a list of modules that will be loaded and do something without requiring to explicitly call them. On option one if two different scrapy middlewares want to load a different module they either overwrite each other or they need to use really ugly regular expressions to add code to the previous script.

(2) Allow "saving" and loading of splash scripts and content scripts.

This can be done in a way similar to redis EVAL / EVALSHA (http://redis.io/commands/EVAL): you send a script and get a hash back, then you can use this hash to preload the script. Splash checks if hash exists, and if script is not found it tells client to send the script again.

It is a solution to save network traffic without requiring an access to server.

Yes, that was my idea for implementing it too.

(3) Allow passing an array with several scripts as the lua_script parameter

Isn't is the same as passing "\n".join(scripts)? If so, it can be a client-side solution.

Yes, but different scripts would have different main methods. That way different scrapy middlewares could extend splash without overwriting each other (I'm always coming back to the scrapy middlewares example because there is where I noticed the problem)

(4) Have "splash" addons. Addons would be lua files that would add some features. You select what addons you want to use by sending the names in a parameter.

Currently you can create such "addons" as Lua modules (see http://splash.readthedocs.org/en/latest/scripting-tutorial.html#custom-lua-modules). They only work in /execute endpoint. Instead of auto-loading user must require them in a script explicitly. A module can provide a set of useful functions (like crawlera.enable()) or just execute some code (e.g. add a splash:on_request handler) as an import side effect. Is it what you want in 4a?

That would be option one, option 4a is auto-loading them so that you don't need to modify the existing lua_script.

dvdbng commented 8 years ago

Addind another possibility to the list, how about allowing loading resources (like adblock lists, javascript profiles, lua scripts, lua modules and proxy profiles) by url? Sounds like it would be easy to implement and useful: you could use a public script on rawgit, the update URL of adblock lists, a javascript files on a public CDN or upload to s3 and pre-sign an URL.

Since this opens an attack vector (loading code from the internet) we could hardcode the hash of the files in the URL and make splash check it like this: https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js#md5=4a356126b9573eb7bd1e9a7494737410 This would make it easier to cache since now you only need to use the hash as the cache key and you don't need to follow all the rules of the cache-*, vary, expires, etc... headers

kmike commented 8 years ago

@Youwotma

dvdbng commented 8 years ago

@kmike

With lua scripts I was referring to what you pass in lua_source, and proposing adding an alternative lua_url parameter, with lua modules I was referring to stuff you require and proposing adding a require_url lua function.