Open mlucy opened 8 years ago
The main documentation is the (short) section called Running a proxy node in the "Scaling, sharding and replication" document; that's also linked under the Proxy nodes heading in "Optimizing query performance," from the end of the "Command line options" document, and from Scaling considerations in the main Changefeed documentation.
That section is relatively short, and if there's stuff that you think it's missing we can add it. (For instance, it doesn't address "what the network config needs to look like.") And if there are other places we need to link it, we can. Unless there's a lot missing I'm not sure whether this really needs to be in its own document, but if you don't think it belongs under "Scaling" maybe there's a case for a better location.
OK, cool! I actually missed that while look for docs somehow. I think once rethinkdb/rethinkdb#5138 is in we should update that section with information on how to automatically start a proxy node (and probably also add some information about the network configuration while we're in there). Assigning to myself in the meantime.
@mlucy This is currently assigned to you. Are you still planning to write something up for this or should we reassign?
I'm going to go ahead and reassign this to myself in order to get some details together. @mlucy Please complain if you have started writing anything up for this.
@danielmewes -- I never did anything on this. We should probably fix https://github.com/rethinkdb/rethinkdb/issues/5138 while we're at this though.
@chipotle Here's a write up of some of the things that we should probably cover:
What is a RethinkDB proxy?
A RethinkDB proxy is a RethinkDB server that doesn't store any persistent state, but performs certain aspects of query processing locally.
A typical use case is running a RethinkDB proxy instance locally on each application server (see figure TODO). We'll take a closer look at typical proxy setups below.
TODO: Maybe insert a figure like the following one that compares a setup without and with proxies. Specifically it should illustrate the idea of having the proxies run on the application servers, with the application connecting directly to the proxy instead of to the individual database servers.
| 1. Basic setup without proxies |
----------------------------------
____________________ ________________
| App server 1 | | DB server 1 |
| _______________ | ----->| |==
| | Application |--|----------->| ---------------- ||
| --------------- | | ||
| | | ||
-------------------- | ________________ ||
Client conn.| | DB server 2 | || Cluster connections
____________________ |---->| |=||
| App server 2 | | ---------------- ||
| _______________ | | ||
| | Application |--|----------->| ________________ ||
| --------------- | | | DB server 3 | ||
| | ----->| |==
-------------------- ----------------
| 2. Setup with proxies |
--------------------------------
____________________ ________________
| App server 1 | | DB server 1 |
| _______________ | | |==
| | Application | | ---------------- ||
| --------------- | ||
| _______|_______ | ||
| | Proxy |==|=====================================||
-------------------- ________________ ||
| DB server 2 | || Cluster connections
____________________ | |=||
| App server 2 | ---------------- ||
| _______________ | ||
| | Application | | ________________ ||
| --------------- | | DB server 3 | ||
| _______|_______ | | |=||
| | Proxy |==|============= ---------------- ||
-------------------- \=========================
More precisely, a proxy:
server_config
tableHowever, a RethinkDB proxy still provides the following features:
orderBy
, operations on arrays etc.).To start a RethinkDB proxy, you can run rethinkdb proxy -j <other server>
. See TODO for
more details on this command.
When should I use a RethinkDB proxy?
Primary use cases include:
We will see in the next section how proxy servers can achieve these objectives.
How can a proxy improve performance?
With a proxy running locally on each application server, you can avoid an additional network hop by facilitating a proxy's intelligent routing logic. The proxy always knows which database server holds the data for a particular request, and can route the request directly to the responsible server. Compare that to a client connecting directly to a database server. Since the application usually doesn't know which database server has the data needed for a particular query, the server handling the query will need to forward the request further to obtain the data. Two network hops will be required in this scenario, instead of just one with the local proxy.
Since proxies also perform certain query processing steps themselves, they can also help
scaling those queries more easily. Adding a proxy to a cluster is often easier than
adding a full database server. The proxy will not only handle decoding the client
requests and encoding the query responses, but will also perform a number of calculations
locally. These specifically include many ReQL commands that operate on in-memory arrays,
as well as commands that work on aggregated data (e.g. orderBy
without an index,
commands following an ungroup
operation etc.).
Changefeeds
In addition to regular queries, a proxy can also be a very powerful tool for scaling changefeed-heavy applications.
A proxy will manage changefeeds locally, and reduce the overhead (RAM, CPU and network) on the database servers. For any write operation to a table with active changefeeds, a database server only sends a single network message to a given proxy. If for example 10,000 clients are listening through a proxy to changes on a particular table, the database server(s) hosting the table will send a single network message to the proxy when the table is changed. The proxy then takes care of forwarding the change message to all 10,000 clients locally.
This even works if the clients are listening to different selections on the table, such
as when using the query r.table('test').getAll(val, {index: "idx"}).changes()
with
different values val
for each changefeed. The proxy will receive one message from the
database servers for every write to the table 'test'
, and will check locally which
changefeeds are affected by this change.
How do I run it?
A proxy has fewer requirements to the system's hardware compared to a regular database server. In particular, it doesn't require fast storage and it uses less RAM because it doesn't need to maintain a local data cache.
That being said, proxies still benefit from fast CPUs, and require enough RAM to process your application's queries. 256 MB of available RAM is often enough for simple queries and moderate query throughput. However complex queries and/or a high number of concurrent operations (including a high number of open changefeeds) might require additional resources for the proxy server.
Note that in contrast to a regular client that can connect to a single server, a proxy server must be able to connect directly to all database servers on their intra-cluster ports (port 29015 by default). If a proxy is unable to connect to all database servers, some tables might become inaccessible for the proxy and queries using those will fail.
If all requirements are met, you can run a proxy through the rethinkdb proxy
command.
See TODO for details.
TODO: Mention/Describe rethinkdb/rethinkdb#5138 once available
TODO: Maybe describe a few "best practice" scenarios in more detail. E.g. "proxy running on each app server" vs. "adding a central pool of proxies to a cluster".
This is fantastic!! Things to maybe expand on:
Thanks for the feedback @williamstein . Very useful. We'll try to incorporate that.
I think connection pools are still useful, because the proxy is still going to be able to utilize multiple cores better with multiple client connections.
What about multiple client connections enables better cpu usage in a proxy?
@hamiltop Yeah that's what I meant. :-)
@danielmewes Sorry, I was asking a question. Why does that enable better cpu usage? What aspect of multiple client connections leads to more cpu usage?
Ah, sorry. Each incoming client connection is assigned to one CPU core randomly (or actually round robin I think). A lot of work for any query run through that connection is going to happen on that core.
So by using multiple connections and spreading queries across them, you can better utilize multiple CPU cores on the proxy.
Interesting. Is that true for normal cluster connections? (not just proxies)
This is true for normal servers as well.
However on a normal server, there are more tasks that are not depending on the core handling the connection, and will use their own model for multi-thread distribution. So it's a little bit less relevant there, depending on the workload.
On Mon, Mar 14, 2016 at 2:26 PM, Peter Hamilton notifications@github.com wrote:
Interesting. Is that true for normal cluster connections? (not just proxies)
— Reply to this email directly or view it on GitHub https://github.com/rethinkdb/docs/issues/962#issuecomment-196527713.
I think most of the information is here. Handing over to @chipotle .
It would also be great to see some recommendations on starting the proxy and restarting it if the process dies or if the server is rebooted. Looking at https://github.com/rethinkdb/rethinkdb/issues/5138 there's a suggestion that this can be done by editing the init script. Not being an expert on this I wasn't that confident to jump into /etc/init.d/rethinkdb and start messing around. I ended up using Upstart instead of altering the init script. I wrote up some notes as I couldn't find an explanation anywhere of how to do this. I'd welcome feedback on the approach I'm certainly not experienced with this.
Thanks for sharing your notes on this @brucepom ( https://medium.com/@brucepomeroy/running-a-rethinkdb-proxy-on-ubuntu-68f8cd308b7b ).
While we should fix this more generally in the mid-term (rethinkdb/rethinkdb#5138), it might be nice to mention how to add the upstart script for the meantime in our docs. @chipotle do you think that's something we could incorporate?
I setup a Rethinkdb proxy on a separate node rather than running proxy in app server itself. So my app will contact proxy node which in turn fetch the data from RethinkDB cluster.
Is there anyway to figure if the query processing is actually happening on proxy machine?
From netstat command i see that my proxy node is connected to some unknow IP (This IP i didn't used/configured anywhere in the network) on port 28015 apart from the cluster nodes.
Hi, I wrote a little post about Running a RethinkDB Proxy as Daemon that could be helpful to someone ...
Here's my attempt to start RethinkDB as a proxy node using systemd in Ubuntu 16.04. Feel free to add to it...
Install RethinkDB per usual outlined in the documentation. (Don't worry about copying the sample configuration file mentioned here.)
Create a systemd unit file
$ vim /lib/systemd/system/rethinkdb-proxy-node.service
Add the following to the file
[Unit]
Description=RethinkDB proxy node
[Service]
User=rethinkdb
Group=rethinkdb
ExecStart=/usr/bin/rethinkdb proxy --join
[Install] WantedBy=multi-user.target
3. Create log dir + manage permissions
$ sudo mkdir /var/log/rethinkdb $ sudo chown rethinkdb:rethinkdb /var/log/rethinkdb
4. Enable run on startup
$ sudo systemctl enable rethinkdb-proxy-node.service
5. Start the service
$ sudo systemctl start rethinkdb-proxy-node.service
**Other useful things...**
Check the status
$ sudo systemctl status rethinkdb-proxy-node.service
Tail the log
$ tail -f /var/log/rethinkdb/rethinkdb.log
@bbar Could you have a pull request for this?
Yes and yes -- Regards,
Atri l'apprenant
I feel like I've been asked about it a lot in the last month, and as far as I can tell we only talk about it as an aside in the changefeed docs. It might be worth adding a page that talks more about the subject (I'm not quite sure where it would go in our existing scheme).
We should probably mention:
@chipotle, any thoughts on whether this is worthwhile and where it should go?