Open ghost opened 8 years ago
I've done further testing in response to your assertions.
First, I added a binary mode to the websocketpp, uwebsocket, and Go (golang.org/x/net/websocket) servers and updated the benchmark client to support it. Performance of all servers increased roughly proportionally to the message size reduction (the same message encoded in binary is about 75% the size of the message encoded in binary) but not more. As you mention, a single broadcast from the server will cause 1000's of JSON decodes in the client, but this test indicates that due to using multiple client machines in parallel that was not a limiting factor.
Next, I tested using C++ instead of Go for the benchmark client. In addition, the C++ benchmark does not use a a full websocket implementation; it works directly with libuv. I used your throughput benchmark as the starting point and updated it to run the same test as the Go tool's binary broadcast test. It was indeed able to get higher results than a single Go client, but running the Go tool on multiple machines in parallel produces higher results.
None of the above tests found the dramatic differences your assertions would lead me to expect. However, I think I found something that explains the substantially different results we observe. I noticed the benchmarks for uwebsockets are hard-coded to connect to 127.0.0.1. This could confound the results in two ways. First, the client and server are running on the same machine. So any resources taken by the benchmark client have a direct negative effect on the server. This explains getting a substantially different result from a very low overhead C++ client and a more heavyweight Go client. Second, by using a loopback interface instead of an actual network there is far less overhead. This allows seeing much higher numbers than is possible when actually on a real network.
The point of a good benchmark is to maximize the result difference between the test subjects.
I do not see the fact that most implementations are within 50% of each other as flaw, I see it as a valid data point that for this particular workload the choice of language and library should probably not be decided just based on throughput. For other workloads, results may be substantially different.
The raw results are here: https://github.com/hashrocket/websocket-shootout/blob/master/results/round-02-binary.md. The C++ benchmark is here: https://github.com/hashrocket/websocket-shootout/tree/master/cpp/bench.
To validate Chapter 6 of my first post, and to really show you how flawed your "benchmark of websocket libraries" is, I made my own server with uWS and it performs multiple hundreds of percentages better than the one you wrote (using the very same uWS):
clients: 1000 95per-rtt: 7ms min-rtt: 4ms median-rtt: 7ms max-rtt: 7ms
clients: 2000 95per-rtt: 15ms min-rtt: 8ms median-rtt: 11ms max-rtt: 18ms
clients: 3000 95per-rtt: 19ms min-rtt: 12ms median-rtt: 14ms max-rtt: 25ms
clients: 4000 95per-rtt: 22ms min-rtt: 16ms median-rtt: 19ms max-rtt: 27ms
clients: 5000 95per-rtt: 31ms min-rtt: 20ms median-rtt: 23ms max-rtt: 36ms
clients: 6000 95per-rtt: 37ms min-rtt: 23ms median-rtt: 27ms max-rtt: 39ms
clients: 7000 95per-rtt: 36ms min-rtt: 26ms median-rtt: 29ms max-rtt: 40ms
clients: 8000 95per-rtt: 41ms min-rtt: 30ms median-rtt: 33ms max-rtt: 45ms
clients: 9000 95per-rtt: 44ms min-rtt: 34ms median-rtt: 37ms max-rtt: 49ms
clients: 10000 95per-rtt: 50ms min-rtt: 38ms median-rtt: 42ms max-rtt: 50ms
clients: 11000 95per-rtt: 54ms min-rtt: 42ms median-rtt: 45ms max-rtt: 59ms
clients: 12000 95per-rtt: 59ms min-rtt: 46ms median-rtt: 49ms max-rtt: 61ms
clients: 13000 95per-rtt: 63ms min-rtt: 50ms median-rtt: 53ms max-rtt: 64ms
clients: 14000 95per-rtt: 65ms min-rtt: 55ms median-rtt: 57ms max-rtt: 68ms
clients: 15000 95per-rtt: 73ms min-rtt: 58ms median-rtt: 61ms max-rtt: 75ms
clients: 16000 95per-rtt: 78ms min-rtt: 62ms median-rtt: 65ms max-rtt: 83ms
clients: 17000 95per-rtt: 89ms min-rtt: 66ms median-rtt: 69ms max-rtt: 145ms
clients: 18000 95per-rtt: 91ms min-rtt: 69ms median-rtt: 73ms max-rtt: 95ms
clients: 19000 95per-rtt: 90ms min-rtt: 73ms median-rtt: 77ms max-rtt: 93ms
clients: 20000 95per-rtt: 94ms min-rtt: 77ms median-rtt: 80ms max-rtt: 95ms
clients: 21000 95per-rtt: 98ms min-rtt: 81ms median-rtt: 86ms max-rtt: 103ms
clients: 22000 95per-rtt: 101ms min-rtt: 86ms median-rtt: 89ms max-rtt: 103ms
clients: 23000 95per-rtt: 105ms min-rtt: 89ms median-rtt: 93ms max-rtt: 105ms
clients: 24000 95per-rtt: 105ms min-rtt: 94ms median-rtt: 97ms max-rtt: 109ms
clients: 25000 95per-rtt: 130ms min-rtt: 97ms median-rtt: 103ms max-rtt: 202ms
clients: 26000 95per-rtt: 115ms min-rtt: 102ms median-rtt: 106ms max-rtt: 116ms
clients: 27000 95per-rtt: 123ms min-rtt: 104ms median-rtt: 112ms max-rtt: 125ms
clients: 28000 95per-rtt: 131ms min-rtt: 110ms median-rtt: 115ms max-rtt: 134ms
Just like Chapter 6 states, a broadcast is ultimately going to end up being a loop of syscalls (which is a constant workload for all servers). That's why it is important to know what you are doing when implementing things like pub/sub and similar things (like this very benchmark of yours). You cannot use your grandmother as a test subject when testing how fast a sports car is and then conclude, based on the fact that your grandmother didn't go any faster, that "all cars are the same speed". What you benchmark in that case is your grandmother, not the car.
By implementing a very simple server based on my own recommendations from this repo: https://github.com/alexhultman/High-performance-pub-sub I was able to give you results of your own benchmark, close to 5x different than those you came up with.
You need to stop tainting the bechmark with your own shortcomings. You cannot conclude that uWS is "about the same" as other low-perf implementations, when the issue is what you put ontop of the library. A server will not just magically be fast just because you swapped to uWS - it requires that you know how to use it and surrounding low-level matters.
Stick with the echo tests, they are standard in this industry: they benchmark receiving performance (parsing + memory management) as well as sending performance (framing and memory management). Everything else is up to the user, it's not part of the websocket library. Node.js, Apache, h2o, NGINX and all those HTTP server measure performance in requests per second aka echo, simply becuse that is the only way to show (without tainting the server with user code) the performance of the server and only the server.
For reference, this is the result I get with the server you wrote in uWS:
clients: 1000 95per-rtt: 25ms min-rtt: 7ms median-rtt: 15ms max-rtt: 26ms
clients: 2000 95per-rtt: 41ms min-rtt: 10ms median-rtt: 32ms max-rtt: 44ms
clients: 3000 95per-rtt: 56ms min-rtt: 14ms median-rtt: 47ms max-rtt: 59ms
clients: 4000 95per-rtt: 72ms min-rtt: 19ms median-rtt: 62ms max-rtt: 76ms
clients: 5000 95per-rtt: 87ms min-rtt: 22ms median-rtt: 80ms max-rtt: 99ms
clients: 6000 95per-rtt: 106ms min-rtt: 25ms median-rtt: 96ms max-rtt: 111ms
clients: 7000 95per-rtt: 125ms min-rtt: 29ms median-rtt: 113ms max-rtt: 132ms
clients: 8000 95per-rtt: 139ms min-rtt: 33ms median-rtt: 129ms max-rtt: 144ms
clients: 9000 95per-rtt: 158ms min-rtt: 37ms median-rtt: 145ms max-rtt: 176ms
clients: 10000 95per-rtt: 182ms min-rtt: 48ms median-rtt: 164ms max-rtt: 189ms
clients: 11000 95per-rtt: 203ms min-rtt: 49ms median-rtt: 185ms max-rtt: 214ms
clients: 12000 95per-rtt: 217ms min-rtt: 49ms median-rtt: 200ms max-rtt: 225ms
clients: 13000 95per-rtt: 240ms min-rtt: 53ms median-rtt: 217ms max-rtt: 252ms
clients: 14000 95per-rtt: 257ms min-rtt: 57ms median-rtt: 234ms max-rtt: 263ms
clients: 15000 95per-rtt: 266ms min-rtt: 74ms median-rtt: 253ms max-rtt: 271ms
clients: 16000 95per-rtt: 282ms min-rtt: 69ms median-rtt: 269ms max-rtt: 285ms
clients: 17000 95per-rtt: 300ms min-rtt: 72ms median-rtt: 288ms max-rtt: 361ms
clients: 18000 95per-rtt: 316ms min-rtt: 88ms median-rtt: 306ms max-rtt: 323ms
clients: 19000 95per-rtt: 331ms min-rtt: 84ms median-rtt: 323ms max-rtt: 336ms
clients: 20000 95per-rtt: 349ms min-rtt: 80ms median-rtt: 341ms max-rtt: 353ms
clients: 21000 95per-rtt: 366ms min-rtt: 91ms median-rtt: 357ms max-rtt: 369ms
clients: 22000 95per-rtt: 386ms min-rtt: 93ms median-rtt: 375ms max-rtt: 388ms
clients: 23000 95per-rtt: 396ms min-rtt: 111ms median-rtt: 391ms max-rtt: 406ms
clients: 24000 95per-rtt: 416ms min-rtt: 98ms median-rtt: 408ms max-rtt: 429ms
clients: 25000 95per-rtt: 436ms min-rtt: 104ms median-rtt: 428ms max-rtt: 537ms
clients: 26000 95per-rtt: 453ms min-rtt: 107ms median-rtt: 446ms max-rtt: 454ms
clients: 27000 95per-rtt: 473ms min-rtt: 112ms median-rtt: 465ms max-rtt: 479ms
clients: 28000 95per-rtt: 487ms min-rtt: 117ms median-rtt: 480ms max-rtt: 492ms
As you can see, the difference is major. Yet the very same websocket library has been utilized. I hope this will get you to realize how flawed this benchmark is.
This is yet again validating my very first post "Chapter 6".
Can you share the code for this?
Yes I can post it, but it would be very unfair if you used it since the other servers would be using a different broadcasting algorithm.
This is what I have currently, it depends on a new function which is not fully decided on yet, but should land some time soon (I have discussed this function for a while with other people doing pub/sub):
#include <uWS/uWS.h>
#include <iostream>
#include <string>
using namespace std;
struct Sender {
std::string data;
uWS::WebSocket<uWS::SERVER> ws;
};
std::vector<Sender> senders;
uWS::Hub hub;
bool newThisIteration, inBatch;
int main(int argc, char *argv[]) {
uv_timer_t timer;
uv_timer_init(hub.getLoop(), &timer);
uv_prepare_t prepare;
prepare.data = &timer;
uv_prepare_init(hub.getLoop(), &prepare);
uv_prepare_start(&prepare, [](uv_prepare_t *prepare) {
if (inBatch) {
uv_timer_start((uv_timer_t *) prepare->data, [](uv_timer_t *t) {}, 1, 0);
newThisIteration = false;
}
});
uv_check_t checker;
uv_check_init(hub.getLoop(), &checker);
uv_check_start(&checker, [](uv_check_t *checker) {
if (inBatch && !newThisIteration) {
std::vector<std::string> messages;
std::vector<int> excludes;
for (Sender s : senders) {
messages.push_back(s.data);
}
if (messages.size()) {
uWS::WebSocket<uWS::SERVER>::PreparedMessage *prepared = uWS::WebSocket<uWS::SERVER>::prepareMessageBatch(messages, excludes, uWS::OpCode::BINARY, false, nullptr);
hub.getDefaultGroup<uWS::SERVER>().forEach([&prepared](uWS::WebSocket<uWS::SERVER> ws) {
ws.sendPrepared(prepared, nullptr);
});
uWS::WebSocket<uWS::SERVER>::finalizeMessage(prepared);
}
for (Sender s : senders) {
s.data[0] = 'r';
s.ws.send(s.data.data(), s.data.length(), uWS::OpCode::BINARY);
}
senders.clear();
inBatch = false;
}
});
hub.onMessage([](uWS::WebSocket<uWS::SERVER> ws, char *message, size_t length, uWS::OpCode opCode) {
switch (message[0]) {
case 'b':
senders.push_back({std::string(message, length), ws});
newThisIteration = true;
inBatch = true;
break;
case 'e':
ws.send(message, length, opCode);
}
});
hub.listen(3000);
hub.run();
}
I landed the initial commit here: https://github.com/uWebSockets/uWebSockets/commit/e4b7584b20ee6d359355aac35b1174697f7e3987
I love the fact that you've put together a nice set of socket implementations in various languages (especially Elixir!).
I would very much like to see a more optimized version of the Node implementation, though. If it took advantage of inline caching and V8 CrankShaft's optimizer it could be doing dramatically better I think.
Most.js does an amazing job at that: https://github.com/cujojs/most/tree/master/test/perf
Chapter 1: You are benchmarking the client, not the server
Let's look at the client you are using to "benchmark" these servers:
Golden rule of benchmarking: benchmark the server, NOT the client. You are benchmarking a high performance C++ server with a low performance golang client. Every JSON (this whole JSON-tainting story is a chapter of its own) receive server side will result in many receives client side. In fact, if you look at µWS as an example, the only thing happening user-side of the server is:
So what are you benchmarking server side here? Well, you are benchmarking the receive of one WebSocket frame followed by one JSON parse and one WebSocket frame formatting -> the rest is 100% the operating system (aka, there is no theoretical way to make it any more efficient)
Now, lets look at the client side: since you are using a low performance golang client with a full WebSocket implementation, every broadcast will result in thousands of WebSocket frame parsings client side. Are you starting to get what I'm pointing at now? You are benchmarking 1 WebSocket frame parsing + 1 JSON parse server side followed by thousands of WebSocket frame parsings client side and you are parsing these in golang!.
I immediately saw a HUGE tainting factor client-side when I started benchmarking WebSocket servers. So what did I do about it? I wrote the client in low-level TCP in C++ and made sure the server was stressed 100% all the time. This dramatically increased the gap between the slow WebSocket servers and the fast ones (as you can see in my benchmark, WebSocket++ is many tens of x:es faster than ws).
If you are going to act like you are benchmarking a high performance server you better write a client that is capable of outperforming the server, otherwise you are not benchmarking anything other than the client. No matter how many client instances you have, it still makes a massive different between having many slow clients in a cluster, or one ultra fast. You are completely tainting any kind of result by having this client.
Chapter 2: the broadcasting benchmark in general
You told me that you was not able to see any difference when doing an echo test, so instead you made this broadcast test. That statement alone solidifies my criticism: your client is so slow that it doesn't make any difference if you have a fast server or a slow one, while in my tests I can see dramatic differences in server performance, even when only dealing with 1 single echo message! I can see a 6x difference in performance between ws and µWS with 1 single echo message, and up to 150x when doing thousands of echoes per TCP chunk. But my point is not the 150x, my point is that it is absolutely possible to showcase a massive difference in performance when doing simple echo benchmarks!. But like I said: it requires that your client is able to stress the server and that means you cannot possibly write it in golang with the standard golang bullshit WebSocket client implementation.
Chapter 3: the JSON tainting
Like you have already heard, the fact that you benchmark 1 WebSocket frame parsing together with 1 JSON parsing, where the JSON parsing is majorly dominant is simply unacceptable. And you pass this off as a WebSocket benchmark! Parsing JSON is extremely slow compared to parsing a WebSocket frame: every single byte of the JSON data will have to be scanned for matchin end-token (if you are inside of a string, it will have to check EVERY BYTE for the end token). Compare this to the WebSocket format where the length of the whole message is given in the header, which makes the parsing O(1) while the JSON parsing is AT LEAST O(n).
Chapter 4: the threading and other random variance
Some servers are threaded, some are not. Some servers are implemented with hash tables, some are not. Some servers have RapidJSON, some have other JSON implementations. You simply have WAY too many variables going all random to give any kind of stable result. Comparing a server utilizing 8 CPU cores with a server restricted to 1 is just mindblowingly invalid. It's not just a bunch of "threads" you can toss in and have it speed-up you also need to take into account the efficient and the inefficient ways of using threading. That varies with implementation.
Chapter 5: gold comments
Chapter 6: low-level primitives vs. high level algorithms
A WebSocket library exposes some very fundamental and low-level functionalities that you as an app developer can use to construct more complex algorithms, like for instance, efficient broadcasting. What this benchmark is trying to simulate is very close to a pub/sub server: you get an event and you push this to all the connected sockets.
Now, as you might know, broadcasting can be implemented with a simple for-loop and a call to the WebSocket send function. This is what you are doing in this "benchmark". Problem with this is, that kind of algorithm for distributing 100 events to X connections is very far from something efficient and does not reflect the underlying low-level library as much as it reflects your own abstract interpretation of "pub/sub".
As an example, I work for a company where pub/sub is part of the problem to optimize. This pub/sub was implemented with a for-loop and calls to send for each socket. I changed this into a far more efficient algorithm that merges the broadcasts and prepares the WebSocket frames for these in an efficient way. This resulted in a 270x speed-up and far outperforms the most common pub/sub implementations out there. Had I used a slow server as the low-level implementation, then this speed-up would not be even remotely close to possible. Yet, it still required me to design the algorithm efficiently.
My point is, you cannot benchmark the low-level fundamentals of a library by benchmarking your own inefficient for-loop that pretty much just calls into the kernel and leaves no room for the user-space server to shine.
End notes
This benchmark is completely flawed and does not in any way show the real personalities of the underlying WebSocket servers. I know for a fact that WebSocket++ far outperforms most other servers and that needs to be properly displayed here. The point of a good benchmark is to maximize the result difference between the test subjects. You want to show difference in terms of X not in terms of minor percentages.