Improve performance - Githubissues

overvenus commented 7 years ago

Hi @stepancheg, thank your for the excellent work.

I've found that current performance is not very good (about 10x slower then go). I know it is in the TODO list, but I wonder if there are any plan for Optimizing performance.

Benchmark

Git HEAD @ 7e2ec8811440711aa38642055be4e8f01d32dad6
Based on long-tests.
Rust server and client were built under release.

cargo build --manifest-path=long-tests/with-rust/Cargo.toml --release

Machine:
- Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x 8
- 4GB DDR3 @ 1600 MHz x 4

Streams are skipped because of incompletion.

Unary request: Echo

I have tweaked go client a little bit, my version runs in 40 goroutines.

Client sends 100000 echo requests.

time ./long_tests_client echo 100000

	Rust Server	Go Server
time(s)	26.732	2.902

FlameGraph

I also recorded a flame graph, hope it helps.

https://gist.github.com/overvenus/018e19ccc23555a7768e15774819f3af#file-kernel-7e2ec88-svg

Thank you! :)

siddontang commented 7 years ago

It seems that we only use one event loop @overvenus Can we use multi threads (one thread runs one event loop)?

stepancheg commented 7 years ago

@siddontang multithreaded event loop with futures-mio-tokio is an area I haven't started exploring yet.

However, at this moment grpc-rust lacks a lot of simpler performance fixes, the most important are excessive memory allocations and memory copying.

overvenus commented 7 years ago

As for multithread, mio does not support multiple threads access to the same event loop concurrently. However, there is a simple workaround, SO_REUSEPORT. We can set multiple event loops on the same port. In fact that is what tokio-proto does.

See more:

siddontang commented 7 years ago

@overvenus

If it not hard to support SO_REUSEPORT, maybe we can send a PR.

/cc @stepancheg

siddontang commented 7 years ago

SO_REUSEPORT only works in Linux kernel 3.9+.

If we can't use SO_REUSEPORT, we can let every event loop use same listening FD and use a lock like nginx to avoid thundering herd problem.

kanekv commented 6 years ago

@stepancheg - with tokio-core deprecated and latest tokio supporting multithreaded event loops are there plans to rewrite with newest tokio?

repi commented 6 years ago

Looks like grpc-rust is still essentially singlethreaded and processing 1 message/event at a time regardless of how many threads one creates with server.http.set_cpu_pool_threads, is that still the case or am I missing something?

Any plans here or anything we can help out with @stepancheg? We have some pretty heavy computation in our requests that absolutely need to be processed on all cores in parallel

stepancheg commented 6 years ago

@repi no, grpc-rust is fully concurrent.

If server.http.set_cpu_pool_threads is specified, server callback is executed in the thread pool, which is useful for synchronous processing (e. g. for synchronous I/O). However, it should be fully concurrent even without thread pool, if input and output streams are used correctly.

I don't understand what's the issue, maybe there's a bug, I'd like to know more.

stepancheg commented 6 years ago

@Kane-Sendgrid I didn't look at latest tokio, AFAIU it's unreleased right?

repi commented 6 years ago

Thanks for confirming @stepancheg . I did some more investigation and the server event processing is indeed running in parallel, great!

Did some more testing in our (early) scenario and looks like it is on the client side with grpc-rust that we are getting no parallelism. When we have a single Client::new_plain created and do parallel calls to initiate RPCs with it (of the simple unary type) and then after issuing all calls we wait on them. It looks like in this scenario the RPCs are sent over serially to the server (also in grpc-rust) and blocking so the server only gets them 1 by 1 and can't process them in parallel from this client.

If we instead create a Client for each thread that we are issuing RPCs from in our app, then we don't have this serial bottleneck.

How is a client object intended to handle parallel calls?

stepancheg commented 6 years ago

@repi again, I'm not sure I understand.

The client is also fully concurrent. The client can execute multiple concurrent requests. The client can be shared by multiple threads.

However, the client doesn't use thread-pool. It practically means that if you supply a stream to a request, that request will be pulled from the event loop and if request data supplier blocks, whole client blocks. But this is not the issue for unary calls.

stepancheg / grpc-rust

Improve performance #27

Benchmark

Unary request: Echo

FlameGraph