Benchmark comparison with grpc-rs

nicholasbishop commented 4 years ago

Background: I've been working on an implementation of the remote execution API using tonic for both clients and the server. I've had some trouble getting good performance out of it, so I started looking into grpc-rs to compare speed.

Unfortunately converting my whole project to grpc-rs for a full comparison would be quite a bit of work, and anyway the tonic API+docs seem much friendlier (not to mention the convenience of using modern futures with async/await). So instead I put together a small benchmark of a very simple grpc service:

https://github.com/nicholasbishop/grpc-bench-rs

The benchmark does show grpc-rs performing quite a bit better, as well as supporting a larger number of client connections. (Following the discussion in https://github.com/hyperium/tonic/issues/209 I added a simple retry loop so that failed requests from the tonic client would be retried, but the loop doesn't seem to terminate in a reasonable time once the number of attempted connections gets too high.)

I'd be interested in any feedback on the benchmark itself -- have I made a mistake somewhere that could be hurting tonic's performance numbers?

I used the Linux perf tool to do some basic profiling. The items above 5% are:

  Children      Self  Command          Shared Object       Symbol
+   45.04%    35.69%  tokio-runtime-w  tonic_server        [.] <bytes::bytes_mut::BytesMut as bytes::buf::buf_mut::BufMut>::put_slice
+   32.74%    23.53%  tokio-runtime-w  libc-2.30.so        [.] __memmove_avx_unaligned_erms
+   24.64%    24.04%  tokio-runtime-w  tonic_server        [.] <tonic::codec::prost::ProstEncoder<T> as tokio_util::codec::encoder::Encoder>::encode

LucioFranco commented 4 years ago

Thanks for running these benches! I've been meaning to write some but it doesn't make to much sense right now because we are waiting on a new release of prost that should allow us to avoid an extra copy which I believe the put_slice is showing that. https://github.com/hyperium/tonic/blob/master/tonic/src/codec/prost.rs#L54

Pretty much tonic wise there is a decent amount of work left to optimize things and the same goes for h2. Mostly, this is just the first attempt at a pure rust http/2 gRPC implementation. So I hope that once the stack starts to stabilize a bit we can put some good work into optimizing things.

nicholasbishop commented 4 years ago

Thanks for the info. I'll be sure to give the test another run once prost is updated.

LucioFranco commented 4 years ago

Great! I'd also be very happy to work on this with you, expanding benchmarks and improving stuff is very much welcome here and I do plan on doing a bunch just kinda waiting for the last few things to flush out.

abel-von commented 4 years ago

Hi, I also did some benchmark test, to compare the performance of tonic and grpc-go, the service code is very easy, just return an empty struct

   async fn read_file(&self, req: Request<PathOpts>) -> Result<Response<ReadFileResult>, Status> {
        Ok(Response::new(ReadFileResult{
            content: vec![],
            error_message: "".to_string()
        }))
    }

go codes:

func (rs *RemoteServer) ReadFile(_ context.Context, opts *api.PathOpts) (*api.ReadFileResult, error) {
    return &api.ReadFileResult{
        Content:      []byte(""),
        ErrorMessage: "",
    }, nil
}

the client send requests in 500 goroutines

        var wg sync.WaitGroup
    concurrentCount := 500
    for i := 0; i < concurrentCount; i++ {
        wg.Add(1)
        go func() {
            for i := 0; i < 500; i++ {
                if _, err := exec.Command("ls", "-al").CombinedOutput(); err != nil {
                    fmt.Printf("failed to start remote command %v\n", err)
                    atomic.AddInt32(&failedCount, 1)
                } else {
                    atomic.AddInt32(&succeedCount, 1)
                }
            }
            wg.Done()
        }()
    }
    wg.Wait()

When sending the tonic server, it costs about 8-9 seconds to finish, while sending to go server, only 4-5 seconds, I thought it is because of the setting of tokio runtime, so I adjust the parameter in the attribute #[tokio::main(core_threads = 16, max_threads = 32)], but the change of core_threads from 8 to 16 do not change the result.

By the way, I also compared the performance in this single connection scenario and the tonic has better performance, the C++ implements of grpc which grpc-rs utilize has a bug which makes all requests on one connection handled in a single thread, sequentially.

But I can not explain the performance degradation of tonic compared with grpc-go, maybe some codes in tonic has to be modified. but I have to do more investigation to find out where is the problem.