skroutz / rafka

Kafka proxy with a simple API, speaking the Redis protocol
https://engineering.skroutz.gr/blog/kafka-rails-integration/
GNU General Public License v3.0
8 stars 0 forks source link

Add support for the MONITOR command #80

Closed dblia closed 5 years ago

dblia commented 5 years ago

MONITOR is a debugging command that streams back every command processed by the Redis server. Since Rafka does not log the commands that the server handles, adding the ability to "attach a terminal on server" and listen to all requests received by the server in real-time, would be a quite useful addition in order to spot bugs easier in a Rafka application.

To enable this functionality, we add an extra channel to the Client struct, the monitorChan. A separate goroutine that is spawned at client's creation time, will stream any data written to that channel on the client's socket. On server-side, we maintain a syncmap of the clients' monitoring channels, named monitorChans. Each time a MONITOR command is triggered, the respective client's channel is appended to that map. Another goroutine named monitorHandler is in-charge of writing every command received by the Rafka server to the respective clients' channels via the main srvMonitorChan channel.

The format of the monitor command mimics the Redis equivalent output, and it's the following:

`

[0 ] "" "" ... "" ` Finally, some cleanup actions are applied to the project's README file **but** without affecting any of the existing documentation. The most important of those are the following: - drop all unnecessary empty lines from the file - trip unnecessary leading spaces - wrap all lines to 100 chars length (no consistency currently) - add missing newlines between titles and paragraphs --- Apparently, this feature should be used with caution since attaching monitoring shells to a Rafka server comes with a cost. To illustrate that cost, I've run a (quite naive) testing scenario which measures the consuming throughput of _100k_ message from a single Rafka consumer on a host with _4 cpus_ (_Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz_) and _16gb_ of memory. Note that the hardware specs where the testsuite was run, are not that important since the main rationale is to illustrate how multiple _MONITORing_ shells may affect the application's performance and **not** to mimic a production-like setup. Below, I present the results of the aforementioned scenario. It's quite clear how the throughput rate is affected by increasing the attaching monitoring shells on a Rafka server: | Active monitors | Consuming Time (100k msgs) | Throughput | |----------------------------|------------|----------------------| | - | 23.1029sec | ~4330msgs/sec | | 1 | 28.2468sec | ~3541msgs/sec | | 5 | 35.7541sec | ~2797msgs/sec | > For reference, it took at around _22.3717sec_ for a Rafka _v0.4.0_ consumer to consume those messages. That decrease in consumer's throughput as the monitoring clients increase, is mainly justified by the increased number of _Write_ calls that the monitoring clients make in order to flush the running commands into their sockets. There's also an overhead for scheduling the extra goroutines however it looks negligible comparing to the _Write_ calls' overhead. This can also be confirmed by listing the _pprof_ results of the _15 most CPU consuming entities_ after running the aforementioned scenario with //zero// and //5// active monitoring clients attached on the Rafka server: ```shell # Active monitors: 0 (pprof) top15 -cum Showing nodes accounting for 3s, 40.00% of 7.50s total Dropped 166 nodes (cum <= 0.04s) Showing top 15 nodes out of 139 flat flat% sum% cum cum% 0.05s 0.67% 0.67% 4.65s 62.00% main.(*Server).Handle 0 0% 0.67% 4.65s 62.00% main.(*Server).ListenAndServe.func3 0.01s 0.13% 0.8% 1.96s 26.13% runtime.mcall 0 0% 0.8% 1.94s 25.87% runtime.park_m 0.03s 0.4% 1.20% 1.92s 25.60% runtime.schedule 1.48s 19.73% 20.93% 1.60s 21.33% syscall.Syscall 0.07s 0.93% 21.87% 1.49s 19.87% runtime.findrunnable 1.30s 17.33% 39.20% 1.30s 17.33% runtime.futex 0 0% 39.20% 1.12s 14.93% bufio.(*Writer).Flush 0 0% 39.20% 1.12s 14.93% github.com/skroutz/rafka/vendor/github.com/secmask/go-redisproto.(*Writer).Flush 0.01s 0.13% 39.33% 1.12s 14.93% net.(*conn).Write 0.01s 0.13% 39.47% 1.11s 14.80% net.(*netFD).Write 0.01s 0.13% 39.60% 1.10s 14.67% internal/poll.(*FD).Write 0.01s 0.13% 39.73% 1.07s 14.27% syscall.Write 0.02s 0.27% 40.00% 1.06s 14.13% syscall.write ``` ```shell # Active monitors: 5 (pprof) top15 -cum Showing nodes accounting for 7.87s, 38.19% of 20.61s total Dropped 277 nodes (cum <= 0.10s) Showing top 15 nodes out of 127 flat flat% sum% cum cum% 0.06s 0.29% 0.29% 7.58s 36.78% main.(*Server).Handle 0 0% 0.29% 7.58s 36.78% main.(*Server).ListenAndServe.func3 7.08s 34.35% 34.64% 7.45s 36.15% syscall.Syscall 0.02s 0.097% 34.74% 7.28s 35.32% github.com/skroutz/rafka/vendor/github.com/secmask/go-redisproto.(*Writer).Flush 0.08s 0.39% 35.13% 7.26s 35.23% bufio.(*Writer).Flush 0.02s 0.097% 35.23% 7.16s 34.74% net.(*conn).Write 0.09s 0.44% 35.66% 7.14s 34.64% net.(*netFD).Write 0.04s 0.19% 35.86% 7.05s 34.21% internal/poll.(*FD).Write 0.01s 0.049% 35.90% 6.74s 32.70% syscall.Write 0.04s 0.19% 36.10% 6.73s 32.65% syscall.write 0.06s 0.29% 36.39% 6.22s 30.18% main.(*Client).monitorWriter 0.06s 0.29% 36.68% 3.69s 17.90% runtime.mcall 0.01s 0.049% 36.73% 3.62s 17.56% runtime.park_m 0.10s 0.49% 37.21% 3.54s 17.18% runtime.schedule 0.20s 0.97% 38.19% 2.72s 13.20% runtime.findrunnable ```