Consider adding more docs on RethinkDB Proxy

mlucy commented 8 years ago

I feel like I've been asked about it a lot in the last month, and as far as I can tell we only talk about it as an aside in the changefeed docs. It might be worth adding a page that talks more about the subject (I'm not quite sure where it would go in our existing scheme).

We should probably mention:

What a proxy node is.
Why they exist.
What the performance characteristics are, especially for changefeeds.
How to start one and what the network config needs to look like (specifically the fact that it behaves like a normal RethinkDB node and thus other nodes in the cluster will try to connect to it).
- Once https://github.com/rethinkdb/rethinkdb/issues/5138 is resolved we should add docs on how to start a proxy node on system startup too.

@chipotle, any thoughts on whether this is worthwhile and where it should go?

chipotle commented 8 years ago

The main documentation is the (short) section called Running a proxy node in the "Scaling, sharding and replication" document; that's also linked under the Proxy nodes heading in "Optimizing query performance," from the end of the "Command line options" document, and from Scaling considerations in the main Changefeed documentation.

That section is relatively short, and if there's stuff that you think it's missing we can add it. (For instance, it doesn't address "what the network config needs to look like.") And if there are other places we need to link it, we can. Unless there's a lot missing I'm not sure whether this really needs to be in its own document, but if you don't think it belongs under "Scaling" maybe there's a case for a better location.

mlucy commented 8 years ago

OK, cool! I actually missed that while look for docs somehow. I think once rethinkdb/rethinkdb#5138 is in we should update that section with information on how to automatically start a proxy node (and probably also add some information about the network configuration while we're in there). Assigning to myself in the meantime.

danielmewes commented 8 years ago

@mlucy This is currently assigned to you. Are you still planning to write something up for this or should we reassign?

danielmewes commented 8 years ago

I'm going to go ahead and reassign this to myself in order to get some details together. @mlucy Please complain if you have started writing anything up for this.

mlucy commented 8 years ago

@danielmewes -- I never did anything on this. We should probably fix https://github.com/rethinkdb/rethinkdb/issues/5138 while we're at this though.

danielmewes commented 8 years ago

@chipotle Here's a write up of some of the things that we should probably cover:

What is a RethinkDB proxy?

A RethinkDB proxy is a RethinkDB server that doesn't store any persistent state, but performs certain aspects of query processing locally.

A typical use case is running a RethinkDB proxy instance locally on each application server (see figure TODO). We'll take a closer look at typical proxy setups below.

TODO: Maybe insert a figure like the following one that compares a setup without and with proxies. Specifically it should illustrate the idea of having the proxies run on the application servers, with the application connecting directly to the proxy instead of to the individual database servers.

 | 1. Basic setup without proxies |
 ----------------------------------                                      
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |            ----->|                |==                       
  | |  Application  |--|----------->|      ----------------  ||                         
  |  ---------------   |            |                        ||                     
  |                    |            |                        ||                    
   --------------------             |      ________________  ||                                       
                        Client conn.|     |   DB server 2  | || Cluster connections
   ____________________             |---->|                |=||                                                
  |     App server 2   |            |      ----------------  ||                                               
  |  _______________   |            |                        ||                          
  | |  Application  |--|----------->|      ________________  ||                 
  |  ---------------   |            |     |   DB server 3  | ||                     
  |                    |            ----->|                |==                     
   --------------------                    ----------------   

  | 2. Setup with proxies        |
  --------------------------------                                                    
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |                  |                |==                       
  | |  Application  |  |                   ----------------  ||                         
  |  ---------------   |                                     ||                     
  |  _______|_______   |                                     ||                    
  | |     Proxy     |==|=====================================||                    
   --------------------                    ________________  ||                                       
                                          |   DB server 2  | || Cluster connections
   ____________________                   |                |=||                                                
  |     App server 2   |                   ----------------  ||                                               
  |  _______________   |                                     ||                          
  | |  Application  |  |                   ________________  ||                 
  |  ---------------   |                  |   DB server 3  | ||                     
  |  _______|_______   |                  |                |=||                    
  | |     Proxy     |==|=============      ----------------  ||     
   --------------------             \=========================

More precisely, a proxy:

Cannot become a replica for any tables
Does not participate in write or failover majorities (i.e. it cannot be used as an to enable auto-failover in smaller clusters)
Does not have a server name or any server tags
Does not show up in certain system tables, including the server_config table

However, a RethinkDB proxy still provides the following features:

It listens for client connections.
It can join a cluster and connects directly to all servers in the cluster, so it can efficiently route read and write operations directly to the responsible replicas.
It parses, interprets and processes queries. While most work will typically be passed on to the nodes that host the table data, certain operations will be performed on the proxy itself (for examle non-indexed orderBy, operations on arrays etc.).
It manages changefeeds locally, allowing deduplication of change notifications within the cluster.

To start a RethinkDB proxy, you can run rethinkdb proxy -j <other server>. See TODO for more details on this command.

When should I use a RethinkDB proxy?

Primary use cases include:

Scaling changefeeds.
Reducing latencies within the cluster.
Improving throughput of queries where the bottleneck lies in certain aspects of the query processing, such as query parsing or expensive array-based operations.

We will see in the next section how proxy servers can achieve these objectives.

How can a proxy improve performance?

With a proxy running locally on each application server, you can avoid an additional network hop by facilitating a proxy's intelligent routing logic. The proxy always knows which database server holds the data for a particular request, and can route the request directly to the responsible server. Compare that to a client connecting directly to a database server. Since the application usually doesn't know which database server has the data needed for a particular query, the server handling the query will need to forward the request further to obtain the data. Two network hops will be required in this scenario, instead of just one with the local proxy.

Since proxies also perform certain query processing steps themselves, they can also help scaling those queries more easily. Adding a proxy to a cluster is often easier than adding a full database server. The proxy will not only handle decoding the client requests and encoding the query responses, but will also perform a number of calculations locally. These specifically include many ReQL commands that operate on in-memory arrays, as well as commands that work on aggregated data (e.g. orderBy without an index, commands following an ungroup operation etc.).

Changefeeds

In addition to regular queries, a proxy can also be a very powerful tool for scaling changefeed-heavy applications.

A proxy will manage changefeeds locally, and reduce the overhead (RAM, CPU and network) on the database servers. For any write operation to a table with active changefeeds, a database server only sends a single network message to a given proxy. If for example 10,000 clients are listening through a proxy to changes on a particular table, the database server(s) hosting the table will send a single network message to the proxy when the table is changed. The proxy then takes care of forwarding the change message to all 10,000 clients locally.

This even works if the clients are listening to different selections on the table, such as when using the query r.table('test').getAll(val, {index: "idx"}).changes() with different values val for each changefeed. The proxy will receive one message from the database servers for every write to the table 'test', and will check locally which changefeeds are affected by this change.

How do I run it?

A proxy has fewer requirements to the system's hardware compared to a regular database server. In particular, it doesn't require fast storage and it uses less RAM because it doesn't need to maintain a local data cache.

That being said, proxies still benefit from fast CPUs, and require enough RAM to process your application's queries. 256 MB of available RAM is often enough for simple queries and moderate query throughput. However complex queries and/or a high number of concurrent operations (including a high number of open changefeeds) might require additional resources for the proxy server.

Note that in contrast to a regular client that can connect to a single server, a proxy server must be able to connect directly to all database servers on their intra-cluster ports (port 29015 by default). If a proxy is unable to connect to all database servers, some tables might become inaccessible for the proxy and queries using those will fail.

If all requirements are met, you can run a proxy through the rethinkdb proxy command. See TODO for details.

TODO: Mention/Describe rethinkdb/rethinkdb#5138 once available

TODO: Maybe describe a few "best practice" scenarios in more detail. E.g. "proxy running on each app server" vs. "adding a central pool of proxies to a cluster".

williamstein commented 8 years ago

This is fantastic!! Things to maybe expand on:

Emphasize that you can add/remove proxies at any time without any impact on cluster stability (unlike normal nodes).
Say something about relationship with connection pools; for example, with SageMathCloud I have a (complicated?) connection pool system, where applications make lots of connections to the database, round robin queries using each, destroy connections that are slow, etc. I used this before using proxy nodes. I switched to proxy nodes and I'm still using this connection pool, but maybe it is completely not necessary?
Maybe say something about how potentially more data gets transferred over the network in some cases. Where it says "For any write operation to a table with active changefeeds,..." you give the optimal best case situation. But isn't there a worst case -- maybe the client is listening for a very specific change, and 99.99% of updates to the table don't trigger that; however, with a proxy node, all of those changes get sent to the proxy node. For my personal use case (everything on a local super fast free network inside Google Compute Engine), even this worst case is fine, since the network is so good. But it could matter for some multi-data center deployments.

danielmewes commented 8 years ago

Thanks for the feedback @williamstein . Very useful. We'll try to incorporate that.

I think connection pools are still useful, because the proxy is still going to be able to utilize multiple cores better with multiple client connections.

hamiltop commented 8 years ago

What about multiple client connections enables better cpu usage in a proxy?

danielmewes commented 8 years ago

@hamiltop Yeah that's what I meant. :-)

hamiltop commented 8 years ago

@danielmewes Sorry, I was asking a question. Why does that enable better cpu usage? What aspect of multiple client connections leads to more cpu usage?

danielmewes commented 8 years ago

Ah, sorry. Each incoming client connection is assigned to one CPU core randomly (or actually round robin I think). A lot of work for any query run through that connection is going to happen on that core.

So by using multiple connections and spreading queries across them, you can better utilize multiple CPU cores on the proxy.

hamiltop commented 8 years ago

Interesting. Is that true for normal cluster connections? (not just proxies)

danielmewes commented 8 years ago

This is true for normal servers as well.

However on a normal server, there are more tasks that are not depending on the core handling the connection, and will use their own model for multi-thread distribution. So it's a little bit less relevant there, depending on the workload.

On Mon, Mar 14, 2016 at 2:26 PM, Peter Hamilton notifications@github.com wrote:

Interesting. Is that true for normal cluster connections? (not just proxies)

— Reply to this email directly or view it on GitHub https://github.com/rethinkdb/docs/issues/962#issuecomment-196527713.

danielmewes commented 8 years ago

I think most of the information is here. Handing over to @chipotle .

brucepom commented 8 years ago

It would also be great to see some recommendations on starting the proxy and restarting it if the process dies or if the server is rebooted. Looking at https://github.com/rethinkdb/rethinkdb/issues/5138 there's a suggestion that this can be done by editing the init script. Not being an expert on this I wasn't that confident to jump into /etc/init.d/rethinkdb and start messing around. I ended up using Upstart instead of altering the init script. I wrote up some notes as I couldn't find an explanation anywhere of how to do this. I'd welcome feedback on the approach I'm certainly not experienced with this.

danielmewes commented 8 years ago

Thanks for sharing your notes on this @brucepom ( https://medium.com/@brucepomeroy/running-a-rethinkdb-proxy-on-ubuntu-68f8cd308b7b ).

While we should fix this more generally in the mid-term (rethinkdb/rethinkdb#5138), it might be nice to mention how to add the upstart script for the meantime in our docs. @chipotle do you think that's something we could incorporate?

suru1432002 commented 7 years ago

I setup a Rethinkdb proxy on a separate node rather than running proxy in app server itself. So my app will contact proxy node which in turn fetch the data from RethinkDB cluster.

Is there anyway to figure if the query processing is actually happening on proxy machine?

From netstat command i see that my proxy node is connected to some unknow IP (This IP i didn't used/configured anywhere in the network) on port 28015 apart from the cluster nodes.

thomasmodeneis commented 7 years ago

Hi, I wrote a little post about Running a RethinkDB Proxy as Daemon that could be helpful to someone ...

bbar commented 6 years ago

Here's my attempt to start RethinkDB as a proxy node using systemd in Ubuntu 16.04. Feel free to add to it...

Install RethinkDB per usual outlined in the documentation. (Don't worry about copying the sample configuration file mentioned here.)

Create a systemd unit file

$ vim /lib/systemd/system/rethinkdb-proxy-node.service

Add the following to the file


[Unit]
Description=RethinkDB proxy node

[Service] User=rethinkdb Group=rethinkdb ExecStart=/usr/bin/rethinkdb proxy --join :29015 --log-file /var/log/rethinkdb/rethinkdb.log --initial-password auto KillMode=process PrivateTmp=true

[Install] WantedBy=multi-user.target


3. Create log dir + manage permissions

$ sudo mkdir /var/log/rethinkdb $ sudo chown rethinkdb:rethinkdb /var/log/rethinkdb


4. Enable run on startup

$ sudo systemctl enable rethinkdb-proxy-node.service


5. Start the service

$ sudo systemctl start rethinkdb-proxy-node.service


**Other useful things...**

Check the status

$ sudo systemctl status rethinkdb-proxy-node.service


Tail the log

$ tail -f /var/log/rethinkdb/rethinkdb.log

atris commented 6 years ago

@bbar Could you have a pull request for this?

bbar commented 6 years ago

@atris sure. In this file, right?

atris commented 6 years ago

Yes and yes -- Regards,

Atri l'apprenant

rethinkdb / docs

Consider adding more docs on RethinkDB Proxy #962