jonhoo / volley

Volley is a benchmarking tool for measuring the performance of server networking stacks.
MIT License
123 stars 12 forks source link

channel-based thread worker-pool #19

Closed xekoukou closed 9 years ago

xekoukou commented 9 years ago

The server workflow is this one:

  1. Creates nclient threads to accept the connections and perform the initial verification of connection.
  2. Creates ncores threads to process the network traffic.
  3. Each thread processes each client till the client has finished sending data and then it informs the main thread that it can process a new client.

In my old laptop, this halves the mean latency. Because the threads work per client rather than interleaving the requests of each client, the stddev is very high.

Both the number of cores and clients need to be known. The number of cores so as to create that amount of threads and the numbers of clients so as to do the initialization phase without blocking by the listener.

The processing is done per client because that is the only way of not blocking the threads without using mio or another poll mechanism.

To run it exec this:

./rust-channels -p 2222 -c ncores -cl nclients

The number of clients need to equal the number of client thread the client process creates.

I managed to halve the mean latency.

xekoukou commented 9 years ago

It would be nice to see the results in a 80 core system.

jonhoo commented 9 years ago

As pointed out in #16, I think it should be considered cheating for the server to know in advance how many clients there will be. "Real" servers do not know this number, although they might know some sensible upper bound on it. The server should instead spawn some number of threads, and either increase the size of the pool if it detects contention, or make clients wait, or multiplex clients on a single thread (though this requires async I/O).

Knowledge of the number of server cores is fine, though it would be better to use something like num_cpus() rather than forcing it to be specified on the command-line.

xekoukou commented 9 years ago

This has nothing to do with that. That is why I tried to make it clear. It has to do with the initialization phace.

jonhoo commented 9 years ago

Ah, sorry, I misread. However, it should still not be necessary for the server to know the number of clients. Instead, it should spawn a number of accept threads equal to the number of cores.

xekoukou commented 9 years ago

Yes, it could but the client waits till all connections are created and then starts the benchmark. So this wouldn't work unless you changed the client code.

jonhoo commented 9 years ago

The client also forces each client connection to do one round of the n -> n+1 protocol before it starts, so I believe your server would still not work correctly (i.e. the client would hang forever because not all connections have replied yet).

While I could change the test client to not wait for any connections to be established, that would mean setup cost per client is also taken into account when benchmarking. This could potentially change the results dramatically, since the way servers choose to do accepts will now also matter quite a lot for a large number of clients.. I'm not sure that is a worthwhile change..

One way to get around this would be to do the following:

  1. Spawn ncores threads, each one calling accept on the same socket
  2. Spawn another ncores threads, each with its own channel (like you do now)
  3. After accept returns in the (1) threads, the thread does a single n -> n+1 handshake on the new stream
  4. The stream is then pushed to the appropriate channel
  5. Each of the (2) threads loop infinitely, taking streams off of their channel and perform all the handshakes they receive from the stream

In particular, because of (3), the client will happily assume all clients have connected, but you will still process a single client at the time.

jonhoo commented 9 years ago

On a separate note, the code is fairly verbose with a lot of repeated blocks. It would be great if you could tidy it up a bit so it is easier to read.

xekoukou commented 9 years ago

What you say could be done, but it increases complexity for nothing. I would also have to pass the TcpStream back to the main thread to be used by the (2) threads.

Do you really want to do that? It doesn't change anything.

xekoukou commented 9 years ago

By the way it works, check the initializate_connection function. It is exactly the same as the handle_client without the loop.

jonhoo commented 9 years ago

It's not for nothing -- it's to make the server realistic. Servers cannot be expected to know how many clients there will be. I don't see why you need to pass the stream to the main thread? Keep a channel of TcpStreams for each (2) thread, and have each (1) thread send the streams they accept to one of these channels.

jonhoo commented 9 years ago

With regards to initialize_connection, that's exactly my point. It would be better if everything that is within the loop was extracted to its own function. That way, both handle_client and initialize_connection could just call that, avoiding the code duplication.

xekoukou commented 9 years ago

The pattern I am using is that the worker threads inform the main thread when they finish processing. So they do not randomly pick a thread. They only send a stream to a thread that is free to work.

I will make the changes so that we don't need ncores or nclients so as to be compatible with your framework.

One other problem I am seeing is that because the stddev is very high the bechmark is performed multiple times.

jonhoo commented 9 years ago

Ah, I see. Sure, then send them through the main thread. I'm not sure that will actually make too much difference compared to sending it to different threads directly in round-robin order.

Great, thank you. As mentioned, it is mostly about trying to make the benchmarks realistic. This is also the reason why the numbers sent to the server are random, and not static. If they were static, a server could cheat by e.g. pregenerating the response, potentially lowering latency that way.

It is true that this will bring stddev up, which is unfortunate, but it is also accurate. They way you're handling clients, some clients will have to wait much longer than others, and the higher stddev shows that. Runtimes still shouldn't be too long due to the 3 second time limit in client/main.c though.

xekoukou commented 9 years ago

I made the changes. I did it in a hurry but it seems to work.

jonhoo commented 9 years ago

Thanks! I'll take a look first thing tomorrow.

jonhoo commented 9 years ago

This is starting to look good, though I'd still like you to factor out the core of handle_client and initialize_connection to avoid the code duplication.

xekoukou commented 9 years ago

Jon, feel free to change the code as you like. I don't have any more time. Given that the mean is half of the rest, this is a good server to show that thread switching can be the reason for half of the latency.

jonhoo commented 9 years ago

Continued in #22.