Comparison with Golang - Githubissues

ionkrutov commented 2 years ago

I was curious how much faster a C++ server would be than a Go server. And I was a little surprised ... I write equivalent code in both languages. And testing this with apache benchmark. ab -n 2000000 -c 1000 -k http://localhost:9595/

The golang code looks like this:


import "net/http"

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Hello World!\n"))
    })

    http.ListenAndServe("localhost:9595", nil)
}

On C++:

#include <pistache/endpoint.h>
#include <iostream>

using namespace Pistache;
using namespace Pistache;

class HelloHandler : public Http::Handler {
public:
    HTTP_PROTOTYPE(HelloHandler)
    void onRequest(const Http::Request& request, Http::ResponseWriter response) {
        response.send(Http::Code::Ok, "Hello World!\n");
    }
};

int main() {
  std::cout << "Server begin listening\n" << std::endl;
  Address addr(Ipv4::any(), Port(9595));
  auto opts = Http::Endpoint::options().threads(6);
  Http::Endpoint server(addr);
  server.init(opts);
  server.setHandler(Http::make_handler<HelloHandler>());
  server.serve();

}

ab gave the following results: for C++ : Requests per second: 55138.30 [#/sec] (mean) for Golang: Requests per second: 58193.58 [#/sec] (mean)

I ran the testing several times and always Golang was slightly faster than C++.

What am I doing wrong? How should I properly use multithreading in pistache? Why is C++ lower?

Thank you in advance.

dennisjenkins75 commented 2 years ago

Interesting.

How many CPU cores are on your server?
Did you run 'ab' from the same machine? How many cores was it using?
Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.
Might be interesting to run pistache under gprof (yes, it will run much slower...) and see where the CPU time is going.

ionkrutov commented 2 years ago

Interesting.

Undoubtedly. :-)

How many CPU cores are on your server?

On my AMD Ryzen 5 4600h laptop with 6 cores and 12 threads. (as an argument Http :: Endpoint :: options (). threads (6); specified both 1 and 12 and 255 - the results were about the same)

Did you run 'ab' from the same machine? How many cores was it using?

Yes, in both cases (for Golang and C ++) started the server locally and on the same host ab.

I will answer a little later.

ionkrutov commented 2 years ago

Might be interesting to run "perf top" on the same machine as pistache and see where the CPU time is going.

C++ server

Golang server

It seems that the C ++ server doesn't work at all.

dennisjenkins75 commented 2 years ago

Please tell "perf top" to only look at the pistache process, and try to take the screenshot with the "ab" window not covering up the interesting bits of the perf top output.

ionkrutov commented 2 years ago

Will it fit?

dennisjenkins75 commented 2 years ago

Thank you for the report and update. I do not have time to dig deeper at the moment, but hope to within a few days.

Go implements a different threading model than C++17, so that might account for part of it. The hotspot in pistache appears to be heap allocations. Maybe linking against tcmalloc might have a small improvement.

kiplingw commented 2 years ago

With recent GCCs parallel execution policies are implemented with Intel's Thread Building Blocks. I'm not sure what Go uses.

dennisjenkins75 commented 2 years ago

Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?

Kip: Do we have any options for automatic performance testing, similar to our CI?

kiplingw commented 2 years ago

I don't think so. But I suspect @Tachi107 could have fun with that with the new skills he's been learning lately.

dennisjenkins75 commented 2 years ago

I attempted to conduct my own performance testing against my pistacheio application. However, most of my overhead was in getting/releasing connections to postgresql from my connection pooler, and in logging http requests to the same database. I did not use "ab" (Apache Benchmark); instead I used "hey" (https://github.com/rakyll/hey); its functionally identical.

I'll need to compile Ivan's examples and tinker.

Suggestion: We either create a "benchmark" directory in pistacheio/pistache and add Ivan's examples (and possibly one in rust), and code up some sort of little "run it locally" benchmark suite, or create a github project like "pistacheio/benchmarks" and place them there. It would be nice to have a set of sample servers that all return identical results for "HTTP GET /" (one using pistache, and others using other tech), and some framework or scripts for testing them all.

dennisjenkins75 commented 2 years ago

I've modified "examples/hello_server.cc" to use 10 threads. I ran hey (as follows) and received the following results on an AMD 5950x (16 core, 32 thread, 64GiB ram). My system was not idle though, but had a background load of ~2 when the benchmark was not running.

I'm not familiar with meson, so I don't know off the top of my head how to compile it with gprof enabled (-g -pg -no-pie)

$ hey -z 20s -c 100 -cpus 10  http://127.0.0.1:9080/ 

Summary:
  Total:    20.0024 secs
  Slowest:  0.1282 secs
  Fastest:  0.0001 secs
  Average:  0.0053 secs
  Requests/sec: 18947.2669

  Total data:   4547892 bytes
  Size/request: 12 bytes

Response time histogram:
  0.000 [1] |
  0.013 [328294]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.026 [26007] |■■■
  0.039 [14244] |■■
  0.051 [6635]  |■
  0.064 [2611]  |
  0.077 [821]   |
  0.090 [223]   |
  0.103 [104]   |
  0.115 [32]    |
  0.128 [19]    |

Latency distribution:
  10% in 0.0004 secs
  25% in 0.0006 secs
  50% in 0.0010 secs
  75% in 0.0020 secs
  90% in 0.0184 secs
  95% in 0.0299 secs
  99% in 0.0514 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0011 secs, 0.0001 secs, 0.1282 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0013 secs, 0.0000 secs, 0.0933 secs
  resp wait:    0.0006 secs, 0.0000 secs, 0.0686 secs
  resp read:    0.0023 secs, 0.0000 secs, 0.1109 secs

Status code distribution:
  [200] 378991 responses

dennisjenkins75 commented 2 years ago

I wonder if the meson build uses -O0 or -O2, or some other optimization strategy.

dennisjenkins75 commented 2 years ago

I should install "ab" and test "ab" vs "hey" w/ identical configs and an identical http server, to see if "hey" performs the same as "ab" or not.

ionkrutov commented 2 years ago

@dennisjenkins75

Ivan: Which compiler are you using, and if its GCC< can you also run a test where you build pistache with a recent clang?

Yesterday I used clang version 10.0.0-4ubuntu1 and today after your question I decided to compile with g++ 9.3.0. The results are roughly the same.

Or need clang version 12?

ionkrutov commented 2 years ago

I also decided to try using hey:

CPP_SERVER:

▶./hey_linux_amd64 -z 20s  -c 100 -cpus 10 http://127.0.0.1:9595

Summary:
  Total:    20.0075 secs
  Slowest:  0.2453 secs
  Fastest:  0.0002 secs
  Average:  0.0060 secs
  Requests/sec: 16578.0475

  Total data:   4311918 bytes
  Size/request: 13 bytes

Response time histogram:
  0.000 [1] |
  0.025 [307081]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.049 [16753] |■■
  0.074 [5835]  |■
  0.098 [1486]  |
  0.123 [350]   |
  0.147 [99]    |
  0.172 [43]    |
  0.196 [26]    |
  0.221 [10]    |
  0.245 [2] |

Latency distribution:
  10% in 0.0008 secs
  25% in 0.0012 secs
  50% in 0.0019 secs
  75% in 0.0033 secs
  90% in 0.0144 secs
  95% in 0.0351 secs
  99% in 0.0641 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0017 secs, 0.0001 secs, 0.1927 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0009 secs, 0.0000 secs, 0.1945 secs
  resp wait:    0.0016 secs, 0.0000 secs, 0.0790 secs
  resp read:    0.0017 secs, 0.0000 secs, 0.1273 secs

Status code distribution:
  [200] 331686 responses

GOLANG_SERVER:

▶./hey_linux_amd64 -z 20s  -c 100 -cpus 10 http://127.0.0.1:9595

Summary:
  Total:    20.0018 secs
  Slowest:  0.0610 secs
  Fastest:  0.0001 secs
  Average:  0.0020 secs
  Requests/sec: 96421.6241

  Total data:   25071878 bytes
  Size/request: 25 bytes

Response time histogram:
  0.000 [1] |
  0.006 [993014]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.012 [6571]  |
  0.018 [353]   |
  0.024 [12]    |
  0.031 [29]    |
  0.037 [9] |
  0.043 [8] |
  0.049 [0] |
  0.055 [0] |
  0.061 [3] |

Latency distribution:
  10% in 0.0002 secs
  25% in 0.0003 secs
  50% in 0.0007 secs
  75% in 0.0014 secs
  90% in 0.0023 secs
  95% in 0.0031 secs
  99% in 0.0055 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0000 secs, 0.0018 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0584 secs
  resp wait:    0.0013 secs, 0.0000 secs, 0.0388 secs
  resp read:    0.0004 secs, 0.0000 secs, 0.0292 secs

Status code distribution:
  [200] 1000000 responses

16000 rps VS 96000 rps

e-dant commented 2 years ago

Let's go some causal profiling!

https://github.com/plasma-umass/coz

These modern network-based projects are exactly where causal profilers succeed and traditional profilers fall a bit short.

We'll need to set up some breakpoints in the pistache library. I'm not familiar enough with pistache's internals to tinker with it. Happy to help if you give me some pointers to the "important parts." Someone mentioned heap allocations? Any file or class in particular?

pistacheio / pistache

Comparison with Golang #994