skruger / Surrogate

Proxy server written in erlang. Supports reverse proxy load balancing and forward proxy with http (including CONNECT), socks4, socks5, and transparent proxy modes.
66 stars 14 forks source link

Feature Request: caching #8

Open ghost opened 12 years ago

ghost commented 12 years ago

I've recently been looking into Erlang as a reverse proxy server which supports Edge Side Includes. Existing reverse proxy servers have limited to no support, and their very dense to code changes for. Wheras erlang is designed from the get go for the event driven life of a proxy server, and seems easy to make small hacks.

It looks like stream filtering is a good solution for Edge Side Includes. I was wondering if there is currently any caching built in or would I need to add that as well?

skruger commented 12 years ago

Caching has been kind of a sticking point for me on surrogate. While the edge side include thing is definitely a good use of stream filters I have my doubts that my current stream filter implementation is equal to the task of doing really good caching.

The problem is that I am not really sure how to structure the code. While it is possible to write a stream filter module that handles caching by instantly crafting a response and sending it out the client socket, I worry that it would become clunky and buggy very fast. The good news is that if it is an addon module then it can be disabled if it misbehaves.

The other approach that I have thought about is modifying the proxy_pass FSM that handles all of the request states. That is the place where I will probably need to make the most changes for handling caching. This might be a good thing to do since changes also need to be made for enforcing http hop by hop headers.

Right now a client request is read and processed then the surrogate to web server request is formed and sent directly out a server socket. If I were to write an abstraction that can handle caching when needed I could pass the full request to that module and it could decide to open a socket and send the request along or pull from a cache. When returning the response from either the cache or from a server socket I could send it back to proxy_pass as a series of messages as is already happening with proxy_read_response.erl. I think I could make this happen by refactoring proxy_read_response into a module that completely owns the server socket though the entire connection life cycle. proxy_pass would still be used for processing filters on requests and responses, but then the new module would interpret headers and decide if it has an object in cache or if it needs to make an http request.

Now for the sanity check... How does this approach sound? How is your erlang? I could really use some help making this caching stuff work. :)

ljackson commented 12 years ago

might also want to think about it in terms of a cache lookup into the cluster and parallel response to which node has that cache on disk kind of scatter gather, then use the first response and drop the rest. This would allow sharding of the cache data inside of the erlang cluster members, and/or allowing for dedicated cache boxes in the surrogate architecture. Thoughts?

ghost commented 12 years ago

My erlang is nonexistant. But I've been coding for over 20 years, expert PHP, advanced Perl, C, Java, and some Python. So I'm not worried about picking up a new language. Even with the items you mentioned, I think it would be faster to learn Erlang via surrogate and add/test features then to try to work it into an existing system..

From your description, I think modifying proxy_pass is the better option from my point of view. To describe a normal workflow:

Surrogate gets request for index.html Based on certain rules, Surrogate checks for a cached copy. If there is a cached copy skip the next step Forward request to web server, get response If nothing was in the cache, and the headers don't preclude caching, cache the data Direct the data through a stream filter for Edge Side Includes to be processed Send the processed data back to the user

For a rather trivial example, most social sites have either "login" or some user specific information in the upper right hand corner. This info is very bad for caching since every user has their own unique corner...so many CMS' resort to generating the entire page for what amounts to 3 links.

Instead, if the page is generated with an edge side include up there, than the reverse proxy server can pull JUST that small bit of data and merge it with cached data.

I'll take a closer look next week at code and start looking into how to handle each of these for trivial cases...the complicated cases can be built up to.

ghost commented 12 years ago

What would help most for me is if someone could list the erl modules by filename that are invoked[in their sequence] in order to handle a request. Then I can just hyperfocus on the relevant code and learn as I go.

skruger commented 12 years ago

The first point of interest is proxy_protocol_http.erl. This is the proxy_protocol handler module that starts the HTTP proxy specific work. It is responsible for initializing proxy_pass.erl. The initial state for proxy_pass is the proxy_start state which expects the client socket to be passed as an event. Take note that the client socket is handed off in this manner to avoid a race condition. The socket needs to change ownership from the proxy_listener process and become owned by the proxy_pass process before proxy_pass tries to read the socket.

proxy_pass sends itself a 'request' event which is handled by the client_send_11 state. in client_send_11(request,...) the proxy_read_request.erl process is started. This process is handed ownership of the client socket and is responsible for reading headers and data. proxy_read_request reads data from the socket and generates erlang messages for {request_header,...}, {request_data,...}, and {end_request_data,...}. These are received and processed by the handle_info() functions which pass them through the proxy filters and send them back to proxy_pass as events which are handled by the client_send_11() functions.

It is in cleint_send_11({requestheader,,_}...) that the first changes would probably need to be made. That is where the decision to connect to the next hop host is made. It is also the place where the final header modifications are made before passing the request on to the next hop host.

Once the connection has been formed and the request is sent control of the server side socket is passed to proxy_read_response.erl. proxy_read_response works almost exactly the same way as proxy_read_request.

If we modify proxy_read_response so that it owns the server socket the entire time we can send the request headers and body as events to the proxy_read_response process. It would then be possible for that process to do a cache lookup and return a cached document or make the decision to connect to a server to fetch the document itself.

An important feature of putting this into proxy_read_response (or some replacement that is derived from proxy_read_response) is that the cached response would still be subject to the proxy filters as it passes the response data back to proxy_pass. This would allow caching of pages that use edge side includes.

A second feature of putting caching into proxy_read_response and making that module's process completely own the server side socket is that we could manage pools of ready proxy_read_response instances which would allow us to possibly do connection keep-alive on the server side not just client side of surrogate.

pib commented 11 years ago

Did anyone ever do any work on adding caching?

skruger commented 11 years ago

This weekend I decided to get back into the code and work on a few things with an eye toward caching and other interesting features. I've been working in the new "simplify" branch that removes a number of clustering relatd features that were not useful as implemented. Now that there is a clear point of separation between the client side and the server side it should be much easier to evaluate if an object is in cache and serve that instead of creating a server connection and requesting the object again.