Simultaneous Connection Load

victorstewart commented 4 years ago

Do you have any data or intuition on how many simultaneous mostly-quiet connections Picoquic can tolerate? Will 100,000 cause it to fall over?

I'd need at least 50,000 (run as a containerized instance per logical core) to consider it for production workloads.

huitema commented 4 years ago

I did not test that, and the old adage says "if it is not tested it probably does not work". How would you write that test?

victorstewart commented 4 years ago

@huitema

Well it would be all about monitoring the stability of the implementation's memory usage, throughput, and latency (and no crashes) with ever growing connection count (maybe in increments of 1000 connections).

Maybe a static volume-of-traffic/time spread over ever more connections. The stressed points would be the data structures that represent connections and their read + write patterns, given the stable traffic volume. Also given single threaded usage, probably more on the data structures themselves.

I assume packet loss and reordering rates depend on traffic volume not open connections. So it might not be too complicated since we don't need any intermediate chaos boxes.

huitema commented 4 years ago

I am not worried about the extra load for attributing a packet to a connection -- this is done with a hash table, and size 50000 would not be too much of an issue. I am also not too worried about the network effects, since we use a single UDP socket for all connections, although there might be something happening in the kernel implementation of UDP and of the ARP caches. I am worried about memory size, since that was not optimized for. I am also worried about other details, such as the shutting down of connections if their are idle for too long, or the extra load of maintenance traffic to avoid the idle mechanism. That tells me that any test should not just create batches of connections, but also simulate what happens when these connections remain idle for at least 10 minutes.

From a testing point of view, I am worried about having 50000 clients. Will the local machine let me create 50000 sockets? What about handling 50000 separate client processes? It seems that a loopback test is difficult to engineer. On the other hand, I do not have the hardware available to simulate a network of clients.

Most tests are done with a simulated network using simulated time, bypassing the system network stack entirely. I could imagine a "test N clients" test written like that, but I a concerned that the same thread will handle both the clients and the server. So the first testing iterations will probably need to deal with the perf of the test itself.

Of course, if you do have a test infrastructure available, I will be happy to look at the results.

huitema commented 4 years ago

@victorstewart What is the traffic model for your applications? Some are easier than other. For example, if the transactions are initiated by the client, it is more efficient to just let the connection expire after an idle timer, and have the client set another connection when a new transaction is needed. With 0-RTT, this can be done without any extra delay. This is what is done for example for DNS over QUIC. But if the server is pushing data to the client at random intervals, then the connection needs to stay up -- and also needs to deal with issues like NATs and Firewalls.

victorstewart commented 4 years ago

I'm pretty sure any modern laptop can support 50,000 sockets across say 8 cores. And I have a toy setup with an Intel NUC running Clear Linux on my LAN that I use as a dev machine. Pretty sure that would do.

So I have 2 traffic models right now (with a 3rd web browser based one to come in the next year).

The 1st is a database (that I wrote) that maintains persistent connections with each server instance on the local machine, as well as a topology of other database instances running on other machines or in other datacenters (for replication purposes). Traffic flows both ways... server pushes updates to the database, and some of these get distributed to subscribers.

The 2nd is server <-> mobile app clients. These also maintain persistent long-lived connections to enable realtime data delivery from server to client, which is a nonnegotiable for my application. Theoretically these connections could be open and idle for hours, depends on the user / hours of the day / days of the week etc. It's currently monitored with TCP Keep-Alive that is updated frequently and dynamically per client. Inevitably clients go in elevators or down to subways and connections break though.

I use a custom application specific protocol for each (on top of TCP and TLS)... because why waste all those roundtrips negotiating WebSockets lol.

That's the birds eye view.

victorstewart commented 4 years ago

I've been considering moving all the traffic over to QUIC, seeing that once you hardware offload all the segmentation to the NIC and use just 1 UDP socket for all clients to enable better batching, that throughput is equivalent to TCP.

But then I just watched this talk today https://www.youtube.com/watch?v=DS0-NWIL3Js and realized that some of these QUIC implementations are incredibly unstable (picoquic seeming to be the exception!). I was considering using Quiche but now am hesitant without similar performance stability data. And Quiche said they can't comment on performance metrics. Maybe they're great like pico but I'd have to test first now. Guess I just assumed a maturity that's still premature.

My gut feeling now is to just revist this topic of transitioning from TCP to QUIC in a year or 2 from now.

huitema commented 4 years ago

That's an interesting presentation. I am well aware that the throughput of picoquic is limited by the cost of socket IO -- see issue #78 for a series of tests on the subject. I have a PR in progress to address some of that PR #1010 . Contrary to what is said in the video, the main culprit is not the cost of data copies from app to kernel, but rather the high overhead of the socket calls -- interrupt cost in particular. On Windows, this might be addressed using asynchronous socket calls; on Linux, by using calls that pass multiple packets at a time.

victorstewart commented 4 years ago

@huitema

well on Linux if you use io_uring that eliminates the cost of interrupts and copies.

Then to minimize repetitive stack traversals... set the UDP_SEGMENT socket option to 1452 (MTU - sizeof(udphdr) - sizeof(ipv6hdr)) and instruct the QUIC implementation to construct packets up to 64KB (max size of an ip packet).

As if kernel 4.5 you can just set this UDP_SEGMENT value to 1452 from the start and forget about it. (before that you had to adjust it dynamically per sendmsg with a cmsg header).

Then when you sendmsg the packet into the kernel it will either get chopped up by generic segmentation offload in the kernel or forwarded to the NIC for GSO chopping (if available). Most recent NICs are good to go. And GSO Partial in the kernel handles a tail packet smaller than 1452.

and on the receive side, there's the concept of GRO (the inverse of GSO) to reconstruct the segmented packets into the original big packet.

But it's actually more performant to use a single UDP socket and benefit from batching.

huitema commented 4 years ago

You cannot use UDP segmentation in QUIC -- spec says so. The main problem is what happens in case of packet loss: lose a single segment, you need to repeat the whole packet. That's why we are looking at variations of sendmmsg().

But for now, let's focus on writing a multiple connection test. That way, we will see what works and what needs improvement. The "test first" philosophy is the reason why in the video Picoquic was the only implementation that was not slowed down by out of order packets.

victorstewart commented 4 years ago

segmentation is the better sendmmsg though for every <64K batch of data.

the quic implementation creates the udp packets (up to 1452 bytes of payload each), then writes however many of them in sequence into a buffer (up to 64KB). And then passes this buffer with sendmsg. It then gets segmented along those 1452 bounds.

then beyond 64KB you can use sendmmsg for many of these buffers of packets. (pacing in mind).

okay, how should we start?

huitema commented 4 years ago

Let's first start with the specific question, "after how many connections does a picoquic server stop working". Before that, I just did a quick measurement of the memory footprint of a connection, on the server side, on Windows, using VS 19.

cnx_t	1	2864	C-context
path	1	768	C-context
stream_head	3	696	C-context	Will grow and shrink as clients open and close streams
BBR State	1	408	C-context
Char	1	34	C-context
path *	1	8	C-context
void	36	19488	Crypto	Mostly crypto. Some shared by multiple connectios, e.g. 12.288 bytes for rand_pool_new
tls_ctx	1	4592	Crypto	TLS Context, per connection. Not actually needed after connection established
aead_ctx	2	160	Crypto
cipher_context	2	80	Crypto
cnx_id_key	8	384	Hashes
local cnxid	8	384	Hashes
Hash item	11	264	Hashes
net_icid_key	1	160	Hashes
net_id_key	1	152	Hashes
net_secret_key	1	152	Hashes
H3 CB	1	80	Http-3
packets	16	9840	Packets	Pool of packets, shared by all connections.
misc frame	1	33	Queued data
Total		40547

That's actually rather encouraging. The 40K figure here is an upper bound. 50000 connections * 40K = 2GB, so a server should be able to allocate that. Besides, the 40K include at least 12K (the random number pool) that will be shared with other connections. That's before any optimization: for example, the TLS memory (4592 bytes, plus some of the "void") could be freed after the initial exchange is complete; the connection context could be reorganized so that data related to initial and handshake packets is freed when the handshake is complete. It should be possible to drop from 40K to 20K without doing anything heroic, maybe less if the connections are really silent.

huitema commented 4 years ago

Did the measurement of the allocation on a second connection -- just 26K, including 4.5K of HTTP-3 server stream context that was not captured in the previous snapshot because the stream was not started yet on that connection. So yes, I am fairly confident this could be brought down to under 20K, thus handling 50K connections with about 1GB.

victorstewart commented 4 years ago

@huitema

That's fantastic! And the HTTP-3 stream is unnecessary for my purposes as well, since I'd only ever use raw QUIC sockets.

How can I help with the next step?

huitema commented 4 years ago

I would add a test to the test library, to see what happens on the server. I would keep it simple first, and follow the design of the stress test: a single threaded test operating in simulated time without touching the network stack. To simplify, I would use a simple QUIC context for all clients. I would write a callback function for the client and the server, to implement a test application that generate sparse traffic

Something like:

Create a QUIC context for the server
Create a QUIC context for the client
Create simulated links from client to server and vice-versa
In a loop, until the desired number of connection is obtained
- Create a new connection on the client
- Queue a first message on that client connection
- Run the simulated send/receive loop for a small number of iterations
Once all clients are created
- Run the simulation until the desired time interval has passed
In a loop, close each of the connections

The simulation loop will:

Check the next "ready" time from the client context
Check the next "ready" time form the server context
Check the next arrival time from the client-server link
Check the next arrival time from the server-client link
Check the next message time from a traffic generation context
Pick the one of these 5 events with the lowest time, and process it.

There are multiple examples of such loops in the test library (picoquictest). This one should be very similar to the existing stress test, but with a specific traffic generator and an emphasis on a large number of connections. The goal is to measure that the server stays responsive, getting statistics on:

The simulated connection establishment delays
The simulated message delays from injection to delivery
Whether undesirable events occur, e.g., loss of connections
The "real time" required to run the test for the simulated time

That design with a single client context will test the load on the server. It will not test the resistance to NAT/Firewall time-out per se. To do that, we would need a more complex design with one QUIC context and one simulated IP address per client, and simulate a stateful firewall in front of each client. The simulation loop would have to manage 50,000 objects instead of 5, and we would end up testing the simulator instead of testing the server. But we can test the firewall effect on scaling by setting the idle timeout per connection at about 30 seconds, and turning on the keep alive mechanism with a 10 seconds keep alive interval.

Like the stress test, this test should have run-time parameters: the desired simulated duration, the desired number of clients, maybe a traffic generation profile. By default, these should have low values adequate for a unit test, the goal being to verify that the simulation works. There should be a command line option in the test execution program to specify running the test with large parameter values.

victorstewart commented 4 years ago

that is a very thorough and comprehensive analysis. I don't have any thoughts to add to that!

do you have any intuition about how much more aggressively NATs and firewalls idle out UDP "connections" than TCP ones? thus what a realistic minimum interval for sending keep-alive packets would be on the open Internet?

huitema commented 4 years ago

NAT timeout TCP connection primarily on observing TCP FIN or RST, so the TCP timeout can be somewhat large. For UDP, they need a shorter timeout, typically between 30 seconds and a couple minutes, see for example RFC 4787. But this will sometimes be shorter, e.g., NAT running out of resources and discarding old connections.

huitema commented 4 years ago

@victorstewart I don't have time to work on the "connection stress" before next week. Do you want to start a PR before that?

victorstewart commented 4 years ago

i'm down to help but definitely not in any rush to arrive at these conclusions. I'm in a multi-month transition mindset.

And it seems like you have a much more thorough understanding of how this should be done than any rough draft i'd rig together. But up to you.

huitema commented 4 years ago

OK, I will see what I can do. I suspect that the test will reveal a couple of hot points, because we never know, and we will want that fixed.

huitema commented 4 years ago

@victorstewart I have a PR in progress to simulate an arbitrary number of connections on a single core. As mentioned in PR #1032 this is the first stage: verify that the simulation works. I have to do a bit of instrumentation to be able to run the test with arbitrary number of connections, and after that some performance analysis to find the bottlenecks.

huitema commented 4 years ago

@victorstewart: The following table summarizes the analysis in PR #1032.

NB Clients	Wall Time (s)	picoquic_reinsert_by_wake_time	After fix
100	0.137		0.140
200	0.271		0.279
500	0.701		0.714
1000	1.497	1.92%	1.471
2000	3.342	4.44%	3.010
5000	10.463	9.50%	7.467
10000	27.216	50.47%	14.624
20000	93.484		32.186
50000	571.468		71.735

As I mentioned in my first response, "if it is not tested it probably does not work". The first tests showed that simulating 50,000 connections for 2 minutes required 9 minutes and 31 seconds of CPU. The beauty of the simulation framework is that I can use the Visual Studio test environment and measure which functions contribute this CPU. The culprit was obvious, the management of the "next connection to wake up" list, whose relative cost was increasing quite a bit as the number of simulated connections grew.

The function included a linear search. Replacing that with a splay solves the scaling issue. I can now say with confidence that yes, a single CPU thread can manage 50,000 mostly idle functions. The linear search was OK for most current users, the linear search is actually slightly more efficient than the splay when there are fewer than 1000 connections per server. But the difference is not large, and robust scaling is worth that.

I think this solves your issue. Do you agree?

victorstewart commented 4 years ago

it does! amazing work thanks so much Christian.

private-octopus / picoquic

Simultaneous Connection Load #1022