hashrocket / websocket-shootout

A comparison of websocket servers in multiple languages and frameworks
MIT License
428 stars 76 forks source link

Criticism of the benchmarkings #44

Open ghost opened 8 years ago

ghost commented 8 years ago

Chapter 1: You are benchmarking the client, not the server

Let's look at the client you are using to "benchmark" these servers:

Golden rule of benchmarking: benchmark the server, NOT the client. You are benchmarking a high performance C++ server with a low performance golang client. Every JSON (this whole JSON-tainting story is a chapter of its own) receive server side will result in many receives client side. In fact, if you look at µWS as an example, the only thing happening user-side of the server is:

So what are you benchmarking server side here? Well, you are benchmarking the receive of one WebSocket frame followed by one JSON parse and one WebSocket frame formatting -> the rest is 100% the operating system (aka, there is no theoretical way to make it any more efficient)

Now, lets look at the client side: since you are using a low performance golang client with a full WebSocket implementation, every broadcast will result in thousands of WebSocket frame parsings client side. Are you starting to get what I'm pointing at now? You are benchmarking 1 WebSocket frame parsing + 1 JSON parse server side followed by thousands of WebSocket frame parsings client side and you are parsing these in golang!.

I immediately saw a HUGE tainting factor client-side when I started benchmarking WebSocket servers. So what did I do about it? I wrote the client in low-level TCP in C++ and made sure the server was stressed 100% all the time. This dramatically increased the gap between the slow WebSocket servers and the fast ones (as you can see in my benchmark, WebSocket++ is many tens of x:es faster than ws).

If you are going to act like you are benchmarking a high performance server you better write a client that is capable of outperforming the server, otherwise you are not benchmarking anything other than the client. No matter how many client instances you have, it still makes a massive different between having many slow clients in a cluster, or one ultra fast. You are completely tainting any kind of result by having this client.

Chapter 2: the broadcasting benchmark in general

You told me that you was not able to see any difference when doing an echo test, so instead you made this broadcast test. That statement alone solidifies my criticism: your client is so slow that it doesn't make any difference if you have a fast server or a slow one, while in my tests I can see dramatic differences in server performance, even when only dealing with 1 single echo message! I can see a 6x difference in performance between ws and µWS with 1 single echo message, and up to 150x when doing thousands of echoes per TCP chunk. But my point is not the 150x, my point is that it is absolutely possible to showcase a massive difference in performance when doing simple echo benchmarks!. But like I said: it requires that your client is able to stress the server and that means you cannot possibly write it in golang with the standard golang bullshit WebSocket client implementation.

Chapter 3: the JSON tainting

Like you have already heard, the fact that you benchmark 1 WebSocket frame parsing together with 1 JSON parsing, where the JSON parsing is majorly dominant is simply unacceptable. And you pass this off as a WebSocket benchmark! Parsing JSON is extremely slow compared to parsing a WebSocket frame: every single byte of the JSON data will have to be scanned for matchin end-token (if you are inside of a string, it will have to check EVERY BYTE for the end token). Compare this to the WebSocket format where the length of the whole message is given in the header, which makes the parsing O(1) while the JSON parsing is AT LEAST O(n).

Chapter 4: the threading and other random variance

Some servers are threaded, some are not. Some servers are implemented with hash tables, some are not. Some servers have RapidJSON, some have other JSON implementations. You simply have WAY too many variables going all random to give any kind of stable result. Comparing a server utilizing 8 CPU cores with a server restricted to 1 is just mindblowingly invalid. It's not just a bunch of "threads" you can toss in and have it speed-up you also need to take into account the efficient and the inefficient ways of using threading. That varies with implementation.

Chapter 5: gold comments

Chapter 6: low-level primitives vs. high level algorithms

A WebSocket library exposes some very fundamental and low-level functionalities that you as an app developer can use to construct more complex algorithms, like for instance, efficient broadcasting. What this benchmark is trying to simulate is very close to a pub/sub server: you get an event and you push this to all the connected sockets.

Now, as you might know, broadcasting can be implemented with a simple for-loop and a call to the WebSocket send function. This is what you are doing in this "benchmark". Problem with this is, that kind of algorithm for distributing 100 events to X connections is very far from something efficient and does not reflect the underlying low-level library as much as it reflects your own abstract interpretation of "pub/sub".

As an example, I work for a company where pub/sub is part of the problem to optimize. This pub/sub was implemented with a for-loop and calls to send for each socket. I changed this into a far more efficient algorithm that merges the broadcasts and prepares the WebSocket frames for these in an efficient way. This resulted in a 270x speed-up and far outperforms the most common pub/sub implementations out there. Had I used a slow server as the low-level implementation, then this speed-up would not be even remotely close to possible. Yet, it still required me to design the algorithm efficiently.

My point is, you cannot benchmark the low-level fundamentals of a library by benchmarking your own inefficient for-loop that pretty much just calls into the kernel and leaves no room for the user-space server to shine.

End notes

This benchmark is completely flawed and does not in any way show the real personalities of the underlying WebSocket servers. I know for a fact that WebSocket++ far outperforms most other servers and that needs to be properly displayed here. The point of a good benchmark is to maximize the result difference between the test subjects. You want to show difference in terms of X not in terms of minor percentages.

jackc commented 7 years ago

I've done further testing in response to your assertions.

First, I added a binary mode to the websocketpp, uwebsocket, and Go (golang.org/x/net/websocket) servers and updated the benchmark client to support it. Performance of all servers increased roughly proportionally to the message size reduction (the same message encoded in binary is about 75% the size of the message encoded in binary) but not more. As you mention, a single broadcast from the server will cause 1000's of JSON decodes in the client, but this test indicates that due to using multiple client machines in parallel that was not a limiting factor.

Next, I tested using C++ instead of Go for the benchmark client. In addition, the C++ benchmark does not use a a full websocket implementation; it works directly with libuv. I used your throughput benchmark as the starting point and updated it to run the same test as the Go tool's binary broadcast test. It was indeed able to get higher results than a single Go client, but running the Go tool on multiple machines in parallel produces higher results.

None of the above tests found the dramatic differences your assertions would lead me to expect. However, I think I found something that explains the substantially different results we observe. I noticed the benchmarks for uwebsockets are hard-coded to connect to 127.0.0.1. This could confound the results in two ways. First, the client and server are running on the same machine. So any resources taken by the benchmark client have a direct negative effect on the server. This explains getting a substantially different result from a very low overhead C++ client and a more heavyweight Go client. Second, by using a loopback interface instead of an actual network there is far less overhead. This allows seeing much higher numbers than is possible when actually on a real network.

The point of a good benchmark is to maximize the result difference between the test subjects.

I do not see the fact that most implementations are within 50% of each other as flaw, I see it as a valid data point that for this particular workload the choice of language and library should probably not be decided just based on throughput. For other workloads, results may be substantially different.

The raw results are here: https://github.com/hashrocket/websocket-shootout/blob/master/results/round-02-binary.md. The C++ benchmark is here: https://github.com/hashrocket/websocket-shootout/tree/master/cpp/bench.

ghost commented 7 years ago

To validate Chapter 6 of my first post, and to really show you how flawed your "benchmark of websocket libraries" is, I made my own server with uWS and it performs multiple hundreds of percentages better than the one you wrote (using the very same uWS):

clients: 1000    95per-rtt: 7ms    min-rtt: 4ms    median-rtt: 7ms    max-rtt: 7ms
clients: 2000    95per-rtt: 15ms    min-rtt: 8ms    median-rtt: 11ms    max-rtt: 18ms
clients: 3000    95per-rtt: 19ms    min-rtt: 12ms    median-rtt: 14ms    max-rtt: 25ms
clients: 4000    95per-rtt: 22ms    min-rtt: 16ms    median-rtt: 19ms    max-rtt: 27ms
clients: 5000    95per-rtt: 31ms    min-rtt: 20ms    median-rtt: 23ms    max-rtt: 36ms
clients: 6000    95per-rtt: 37ms    min-rtt: 23ms    median-rtt: 27ms    max-rtt: 39ms
clients: 7000    95per-rtt: 36ms    min-rtt: 26ms    median-rtt: 29ms    max-rtt: 40ms
clients: 8000    95per-rtt: 41ms    min-rtt: 30ms    median-rtt: 33ms    max-rtt: 45ms
clients: 9000    95per-rtt: 44ms    min-rtt: 34ms    median-rtt: 37ms    max-rtt: 49ms
clients: 10000    95per-rtt: 50ms    min-rtt: 38ms    median-rtt: 42ms    max-rtt: 50ms
clients: 11000    95per-rtt: 54ms    min-rtt: 42ms    median-rtt: 45ms    max-rtt: 59ms
clients: 12000    95per-rtt: 59ms    min-rtt: 46ms    median-rtt: 49ms    max-rtt: 61ms
clients: 13000    95per-rtt: 63ms    min-rtt: 50ms    median-rtt: 53ms    max-rtt: 64ms
clients: 14000    95per-rtt: 65ms    min-rtt: 55ms    median-rtt: 57ms    max-rtt: 68ms
clients: 15000    95per-rtt: 73ms    min-rtt: 58ms    median-rtt: 61ms    max-rtt: 75ms
clients: 16000    95per-rtt: 78ms    min-rtt: 62ms    median-rtt: 65ms    max-rtt: 83ms
clients: 17000    95per-rtt: 89ms    min-rtt: 66ms    median-rtt: 69ms    max-rtt: 145ms
clients: 18000    95per-rtt: 91ms    min-rtt: 69ms    median-rtt: 73ms    max-rtt: 95ms
clients: 19000    95per-rtt: 90ms    min-rtt: 73ms    median-rtt: 77ms    max-rtt: 93ms
clients: 20000    95per-rtt: 94ms    min-rtt: 77ms    median-rtt: 80ms    max-rtt: 95ms
clients: 21000    95per-rtt: 98ms    min-rtt: 81ms    median-rtt: 86ms    max-rtt: 103ms
clients: 22000    95per-rtt: 101ms    min-rtt: 86ms    median-rtt: 89ms    max-rtt: 103ms
clients: 23000    95per-rtt: 105ms    min-rtt: 89ms    median-rtt: 93ms    max-rtt: 105ms
clients: 24000    95per-rtt: 105ms    min-rtt: 94ms    median-rtt: 97ms    max-rtt: 109ms
clients: 25000    95per-rtt: 130ms    min-rtt: 97ms    median-rtt: 103ms    max-rtt: 202ms
clients: 26000    95per-rtt: 115ms    min-rtt: 102ms    median-rtt: 106ms    max-rtt: 116ms
clients: 27000    95per-rtt: 123ms    min-rtt: 104ms    median-rtt: 112ms    max-rtt: 125ms
clients: 28000    95per-rtt: 131ms    min-rtt: 110ms    median-rtt: 115ms    max-rtt: 134ms

Just like Chapter 6 states, a broadcast is ultimately going to end up being a loop of syscalls (which is a constant workload for all servers). That's why it is important to know what you are doing when implementing things like pub/sub and similar things (like this very benchmark of yours). You cannot use your grandmother as a test subject when testing how fast a sports car is and then conclude, based on the fact that your grandmother didn't go any faster, that "all cars are the same speed". What you benchmark in that case is your grandmother, not the car.

By implementing a very simple server based on my own recommendations from this repo: https://github.com/alexhultman/High-performance-pub-sub I was able to give you results of your own benchmark, close to 5x different than those you came up with.

You need to stop tainting the bechmark with your own shortcomings. You cannot conclude that uWS is "about the same" as other low-perf implementations, when the issue is what you put ontop of the library. A server will not just magically be fast just because you swapped to uWS - it requires that you know how to use it and surrounding low-level matters.

Stick with the echo tests, they are standard in this industry: they benchmark receiving performance (parsing + memory management) as well as sending performance (framing and memory management). Everything else is up to the user, it's not part of the websocket library. Node.js, Apache, h2o, NGINX and all those HTTP server measure performance in requests per second aka echo, simply becuse that is the only way to show (without tainting the server with user code) the performance of the server and only the server.

For reference, this is the result I get with the server you wrote in uWS:

clients: 1000    95per-rtt: 25ms    min-rtt: 7ms    median-rtt: 15ms    max-rtt: 26ms
clients: 2000    95per-rtt: 41ms    min-rtt: 10ms    median-rtt: 32ms    max-rtt: 44ms
clients: 3000    95per-rtt: 56ms    min-rtt: 14ms    median-rtt: 47ms    max-rtt: 59ms
clients: 4000    95per-rtt: 72ms    min-rtt: 19ms    median-rtt: 62ms    max-rtt: 76ms
clients: 5000    95per-rtt: 87ms    min-rtt: 22ms    median-rtt: 80ms    max-rtt: 99ms
clients: 6000    95per-rtt: 106ms    min-rtt: 25ms    median-rtt: 96ms    max-rtt: 111ms
clients: 7000    95per-rtt: 125ms    min-rtt: 29ms    median-rtt: 113ms    max-rtt: 132ms
clients: 8000    95per-rtt: 139ms    min-rtt: 33ms    median-rtt: 129ms    max-rtt: 144ms
clients: 9000    95per-rtt: 158ms    min-rtt: 37ms    median-rtt: 145ms    max-rtt: 176ms
clients: 10000    95per-rtt: 182ms    min-rtt: 48ms    median-rtt: 164ms    max-rtt: 189ms
clients: 11000    95per-rtt: 203ms    min-rtt: 49ms    median-rtt: 185ms    max-rtt: 214ms
clients: 12000    95per-rtt: 217ms    min-rtt: 49ms    median-rtt: 200ms    max-rtt: 225ms
clients: 13000    95per-rtt: 240ms    min-rtt: 53ms    median-rtt: 217ms    max-rtt: 252ms
clients: 14000    95per-rtt: 257ms    min-rtt: 57ms    median-rtt: 234ms    max-rtt: 263ms
clients: 15000    95per-rtt: 266ms    min-rtt: 74ms    median-rtt: 253ms    max-rtt: 271ms
clients: 16000    95per-rtt: 282ms    min-rtt: 69ms    median-rtt: 269ms    max-rtt: 285ms
clients: 17000    95per-rtt: 300ms    min-rtt: 72ms    median-rtt: 288ms    max-rtt: 361ms
clients: 18000    95per-rtt: 316ms    min-rtt: 88ms    median-rtt: 306ms    max-rtt: 323ms
clients: 19000    95per-rtt: 331ms    min-rtt: 84ms    median-rtt: 323ms    max-rtt: 336ms
clients: 20000    95per-rtt: 349ms    min-rtt: 80ms    median-rtt: 341ms    max-rtt: 353ms
clients: 21000    95per-rtt: 366ms    min-rtt: 91ms    median-rtt: 357ms    max-rtt: 369ms
clients: 22000    95per-rtt: 386ms    min-rtt: 93ms    median-rtt: 375ms    max-rtt: 388ms
clients: 23000    95per-rtt: 396ms    min-rtt: 111ms    median-rtt: 391ms    max-rtt: 406ms
clients: 24000    95per-rtt: 416ms    min-rtt: 98ms    median-rtt: 408ms    max-rtt: 429ms
clients: 25000    95per-rtt: 436ms    min-rtt: 104ms    median-rtt: 428ms    max-rtt: 537ms
clients: 26000    95per-rtt: 453ms    min-rtt: 107ms    median-rtt: 446ms    max-rtt: 454ms
clients: 27000    95per-rtt: 473ms    min-rtt: 112ms    median-rtt: 465ms    max-rtt: 479ms
clients: 28000    95per-rtt: 487ms    min-rtt: 117ms    median-rtt: 480ms    max-rtt: 492ms

As you can see, the difference is major. Yet the very same websocket library has been utilized. I hope this will get you to realize how flawed this benchmark is.

This is yet again validating my very first post "Chapter 6".

jackc commented 7 years ago

Can you share the code for this?

ghost commented 7 years ago

Yes I can post it, but it would be very unfair if you used it since the other servers would be using a different broadcasting algorithm.

This is what I have currently, it depends on a new function which is not fully decided on yet, but should land some time soon (I have discussed this function for a while with other people doing pub/sub):

#include <uWS/uWS.h>
#include <iostream>
#include <string>
using namespace std;

struct Sender {
    std::string data;
    uWS::WebSocket<uWS::SERVER> ws;
};

std::vector<Sender> senders;
uWS::Hub hub;
bool newThisIteration, inBatch;

int main(int argc, char *argv[]) {

    uv_timer_t timer;
    uv_timer_init(hub.getLoop(), &timer);

    uv_prepare_t prepare;
    prepare.data = &timer;
    uv_prepare_init(hub.getLoop(), &prepare);
    uv_prepare_start(&prepare, [](uv_prepare_t *prepare) {
        if (inBatch) {
            uv_timer_start((uv_timer_t *) prepare->data, [](uv_timer_t *t) {}, 1, 0);
            newThisIteration = false;
        }
    });

    uv_check_t checker;
    uv_check_init(hub.getLoop(), &checker);
    uv_check_start(&checker, [](uv_check_t *checker) {
        if (inBatch && !newThisIteration) {
            std::vector<std::string> messages;
            std::vector<int> excludes;
            for (Sender s : senders) {
                messages.push_back(s.data);
            }

            if (messages.size()) {
                uWS::WebSocket<uWS::SERVER>::PreparedMessage *prepared = uWS::WebSocket<uWS::SERVER>::prepareMessageBatch(messages, excludes, uWS::OpCode::BINARY, false, nullptr);
                hub.getDefaultGroup<uWS::SERVER>().forEach([&prepared](uWS::WebSocket<uWS::SERVER> ws) {
                    ws.sendPrepared(prepared, nullptr);
                });
                uWS::WebSocket<uWS::SERVER>::finalizeMessage(prepared);
            }

            for (Sender s : senders) {
                s.data[0] = 'r';
                s.ws.send(s.data.data(), s.data.length(), uWS::OpCode::BINARY);
            }

            senders.clear();
            inBatch = false;
        }
    });

    hub.onMessage([](uWS::WebSocket<uWS::SERVER> ws, char *message, size_t length, uWS::OpCode opCode) {
        switch (message[0]) {
        case 'b':
            senders.push_back({std::string(message, length), ws});
            newThisIteration = true;
            inBatch = true;
            break;
        case 'e':
            ws.send(message, length, opCode);
        }
    });

    hub.listen(3000);
    hub.run();
}

I landed the initial commit here: https://github.com/uWebSockets/uWebSockets/commit/e4b7584b20ee6d359355aac35b1174697f7e3987

DarkMarmot commented 7 years ago

I love the fact that you've put together a nice set of socket implementations in various languages (especially Elixir!).

I would very much like to see a more optimized version of the Node implementation, though. If it took advantage of inline caching and V8 CrankShaft's optimizer it could be doing dramatically better I think.

Most.js does an amazing job at that: https://github.com/cujojs/most/tree/master/test/perf

ghost commented 7 years ago

Good write up. I also wonder why the ws websocket library was used instead of uWS. When uWS is far better in performance. That's not fair for nodejs or the author, and I think the blog chart should be re-updated with uWS numbers instead.