New SSL connection throttling performs even worse (with libuv)

not-implemented commented 4 years ago

While upgrading from uWebSockets 0.14, I did a benchmark with 5000 concurrent SSL (WebSocket) connections (all opened in parallel!) and noticed much (!) higher CPU usage compared to 0.14. The process was even stuck at 100% CPU for minutes instead of seconds (but the event-loop was still responsive). Without SSL it performs very well.

While researching I found the MAX_HANDSHAKES_PER_LOOP_ITERATION in src/crypto/openssl.c and the idea behind (which is nice).

But Valgrind benchmarking shows that in my case ssl_ignore_data() was called 3323616 times, and ssl_on_data() only 747 times (I didn't wait for all 5000 connections).

It seems a lot of time is wasted, iterating the event-loop and the ready handles over and over, just for ignoring most of the new connections in the queue.

When disabling this feature (I just set MAX_HANDSHAKES_PER_LOOP_ITERATION to 5000), it performs well (comparable to 0.14).

I currently don't have an idea, how the throttling of new connections could be implemented more performant - but currently it seems to be worse than without throttling under very high loads.

not-implemented commented 4 years ago

/ Maybe it would be a better approach, to throttle new connections at accept(), and do not accept all new connections at once? (https://github.com/uNetworking/uSockets/blob/master/src/loop.c#L192) /

... no, I see, that does not solve the problem.

not-implemented commented 4 years ago

Another idea:

When ignoring a readable socket (in https://github.com/uNetworking/uSockets/blob/master/src/loop.c#L228), us_poll_change() should be called and remove the LIBUS_SOCKET_READABLE flag, and the socket has to be put into a "postpone"-list.
In us_internal_loop_pre() the first 5 (or MAX_HANDSHAKES_PER_LOOP_ITERATION) sockets can be changed back with us_poll_change() to LIBUS_SOCKET_READABLE

I didn't test this, and maybe there are race-conditions ... but it seem the only way to avoid unnecessary event-loop-iterations.

ghost commented 4 years ago

I think the idea of throttling SSL handshakes is somewhat sane. A user was very pissed about major spikes in latency when performing many upgrades and this simple fix solved the lately issues.

It could probably be made more efficient though. What if you just increase it a bit? That should balance things a bit?

not-implemented commented 4 years ago

I think the idea of throttling SSL handshakes is somewhat sane.

Yes, of course!

What if you just increase it a bit?

When I set MAX_HANDSHAKES_PER_LOOP_ITERATION to 50, it recovers a little faster from 100% CPU usage (1 minute or so ... but still very slow).

I also put some debugging output in ssl_ignore_data() to see, how often the function is called inside the same us_loop_iteration_number - it starts with some hundred calls per iteration-number, and then at one point, it is called ~49000 for every iteration-number - and then it is stuck a long time at 100% CPU until it recovers. Example:

last_iteration_nr: 176, sameLoopCounter: 0
last_iteration_nr: 177, sameLoopCounter: 1
last_iteration_nr: 178, sameLoopCounter: 380
last_iteration_nr: 179, sameLoopCounter: 618
last_iteration_nr: 180, sameLoopCounter: 569
last_iteration_nr: 181, sameLoopCounter: 697
last_iteration_nr: 182, sameLoopCounter: 649
last_iteration_nr: 183, sameLoopCounter: 779
last_iteration_nr: 184, sameLoopCounter: 732
last_iteration_nr: 185, sameLoopCounter: 993
last_iteration_nr: 186, sameLoopCounter: 943
last_iteration_nr: 187, sameLoopCounter: 49099
last_iteration_nr: 188, sameLoopCounter: 49148
last_iteration_nr: 189, sameLoopCounter: 49147
last_iteration_nr: 190, sameLoopCounter: 49149
...

ghost commented 4 years ago

Those numbers don't make any sense - I think you have found a bug somewhere.

We only read a maximum of 1024 events per loop iteration, so a maximum of 1024-5 reads should be ignored. Anything crazy like 50k is broken and wrong.

Please continue to debug and see if you can figure out where things break, maybe the last_iteration_nr is not properly increased? I think the bug should be fairly simple once you find it.

not-implemented commented 4 years ago

Okay, I had a look into libuv source now:

In uv_run() in unix/core.c@381 uv__io_poll() is called after uv__run_prepare(), which increases the last_iteration_nr in our case ... fine.

Looking into uv__io_poll(): The epoll_wait() is also inside a for-loop, see unix/linux-core.c@284. And there is a check at the end at unix/linux-core.c@447, if we have 1024 (ARRAY_SIZE(events)) ready file descriptors, and if so, it continues the loop 48 times (see count init at unix/linux-core.c@281).

So last_iteration_nr is never increased in between. That exactly explains those numbers of nearly 49152 (48 * 1024) calls per loop iteration, and why it jumps from 943 calls directly to 49099 calls in my last test.

It seems, libuv is just not designed for ignoring events :-(

But what do you think about my comment above?

ghost commented 4 years ago

Wow libuv.. wtf.

Okay but just flip the counting like this -

1024/5 = 204 ignores in a row. So just have a counter set to 204 and every time you ignore a handshake you increase the counter up to 204 then you let one handshake happen then reset counter to 0 and ignore up to 204 again.

That should have the same effect, but work with any size of events per iteration.

ghost commented 4 years ago

In going to think about this. I don't like the us_poll_change solution. This counting should be a lot more forgiving if you only have connects though, it needs to take into account how many regular data events it gets

not-implemented commented 4 years ago

Hmm ... why don't you like the us_poll_change() solution?

The counting solution will never scale well under very high loads - when having really 49000 half-open SSL-connections, you have the same problem even without libuv. Libuv just showed the problem earlier.

I still suggest the following:

The ignore_data callback could be changed to a is_low_priority callback (that just returns the result of SSL_in_init() for crypto connections)
The whole counting logic could be moved outside of crypto
The decision, if a socket is throttled (low-prio-count per loop-iteration < threshold), could be centralized to us_internal_dispatch_ready_poll()
Remove LIBUS_SOCKET_READABLE for throttled sockets and put to a low-priority-queue
In us_internal_loop_post() (better than "pre" I think), the first N sockets in the queue will be reactivated ... so we get a real "FIFO" queue for free, and not a random priority
And the behaviour of Libuv is even perfect then, because the "normal" data-events are even more prioritized then
Okay we maybe need a new flag to not put the same socket into the low-priority-queue on every loop

But that would scale well under high loads I think - and does not have an impact on low loads. Or did I miss something?

ghost commented 4 years ago

I don't like it short term, too complex. Long term maybe something like that should be implemented but right now that's too scary

not-implemented commented 4 years ago

As I really like the idea of throttling SSL connections (and the current situation is a blocker for us to upgrade to uWebSockets 0.17), I implemented the queue now. See #97

It seems to work. The performance is well (~40 seconds to fully establish 5000 concurrent SSL connections in a first simple test - that's ~8ms CPU per connection, which is in any case much much better).

I just reuse the doubly-linked-list fields of the sockets and remove them from the "normal" list, which seems to be only used for timeouts - which is fine I think, as we do not need timeout-handling for throttled connections. After delaying they are reinserted into the "normal" list.

Let me know, if you like it, or if you have some better ideas :-)

ghost commented 4 years ago

Sorry for the delay; my plan is to release v18 with a few high-level features that's missing and high priority. After that 18th version runs stable, I should be able to work a lot on low level stuff again. So uSockets is quite low prio right now.

not-implemented commented 4 years ago

So you put the low-priority-queue to the low-priority-queue :-D ;-)

No hurry ... we still run the patched version in production.

ghost commented 4 years ago

Ba Dum Tish!

Do you have any stability issues overall? Like, you run servers for weeks without issues right?

not-implemented commented 4 years ago

No crashes or other stability issues since 4 weeks :-)

(One instance had a little leaky memory usage for one day, but this may also be an issue in application code and was not reproducable since then)

ghost commented 4 years ago

23 days ago, wtf I missed this response.

Great, thanks, so you've successfully moved from v0.14 to v17 and lived to tell the tale!

not-implemented commented 4 years ago

So ... no stability issues since > 7 weeks now :-)

ghost commented 4 years ago

The counting solution will never scale well under very high loads

Benchmarks please

which is fine I think, as we do not need timeout-handling for throttled connections

They need timeouts, otherwise you'll have memory leaks since they can get stuck

so we get a real "FIFO" queue for free, and not a random priority

It's not supposed to be fifo. The kernel has no guarantees for ordering

I would like to see numbers for counting solution

ghost commented 4 years ago

https://stackoverflow.com/questions/19114001/does-epoll-preserve-the-order-in-which-fds-was-registered/19114553#19114553

Apparently they are indeed fifo

ghost commented 4 years ago

If the time sweep goes over both low priority list and regular list, ticking, you solve the potential memory leaks

not-implemented commented 4 years ago

The counting solution will never scale well under very high loads

Benchmarks please

Ok, I'll prepare some more detailled benchmarks shortly.

which is fine I think, as we do not need timeout-handling for throttled connections

They need timeouts, otherwise you'll have memory leaks since they can get stuck

While being in the low-priority-queue, the responsibility to handle the socket is still on our side, not on client side. We are responsible to put the socket back to normal processing, and handle the incoming SSL-Handshake, when the CPU usage allows it. After the socket has been put back, it also follows the normal timeout handling. The low-priority-queue should always be empty after handling the incoming connection peaks (at least if the current implementation has no error). So there should not be any memory leaks, regardless of clients behaviour. The idea was, to "extend" the timeouts in overload-scenarios by removing it from the "normal" list. And the main reason, of course, we save two additional pointers per socket for a second doubly-linked-list. But of course, a second list will also be possible, but I still think, there is no need for it.

so we get a real "FIFO" queue for free, and not a random priority

It's not supposed to be fifo. The kernel has no guarantees for ordering

Ah okay ... but see https://github.com/uNetworking/uSockets/pull/97#issuecomment-620672702 ... over the time I changed the low-priority to a LIFO queue anyway. See my comments in #97 for an explanation (I also noted it in the code comments).

I would like to see numbers for counting solution

As I said ... I'll prepare benachmarks :-)

ghost commented 4 years ago

We are responsible to put the socket back to normal processing, and handle the incoming SSL-Handshake, when the CPU usage allows it

Right, but what if you only got a small portion of the needed data, say 1 byte? Or what if finishing the handshake requires writing then reading more?

Does it go back to timeout even in those cases?

not-implemented commented 4 years ago

Yes of course, maybe this was a small misunderstanding. Just have a look at the code:

The "lifetime" of the low-priority-queue ist just from the "LIBUS_SOCKET_READABLE" event ...

https://github.com/uNetworking/uSockets/pull/97/files#diff-3ca823489e2803bdd90f0774bbdd0c09R267

... until we have time/budget to handle this one "LIBUS_SOCKET_READABLE"-event ...

https://github.com/uNetworking/uSockets/pull/97/files#diff-3ca823489e2803bdd90f0774bbdd0c09R118

... and is removed from the low-priority-queue immediately before we handle it (and maybe reinserted on the next "LIBUS_SOCKET_READABLE"-event ...). It is NOT in the low-priority-queue the whole handshake. So the change is as unintrusive as possible.

not-implemented commented 4 years ago

The idea was "in case of overload, just tell epoll() to not inform us about the readability for now" ... and to not forget to tell epoll() when we are interested in the readability again, we need the low-priority-queue. And this happens in us_internal_handle_low_priority_sockets() which is now called from us_internal_loop_pre() on every event loop.

not-implemented commented 4 years ago

@alexhultman Okay, I did some more detailled benchmarking now - Setup:

I used your "examples/HelloWorld.cpp" from uWebSockets (with a proper self signed cert) (commit c7506a0)
Building: WITH_OPENSSL=1 make examples
ulimit -n 20000
./HelloWorld
I used "wrk" for benchmarking:
ulimit -n 20000
wrk -c 10000 https://localhost:3000/hello

So we have up to 10000 parallel incoming SSL connections. Note that the benchmark is even without libuv, so we even do not have the "48x"-problem here (see https://github.com/uNetworking/uSockets/issues/96#issuecomment-615596193).

Benchmark results (both on the same machine):

current master (uWebSockets commit c7506a0 - with corresponding uSockets version): Requests/sec: 35.70 (and socket errors)
with low-priority-queue patch: Requests/sec: 8117.43 (no socket errors)

Full result (current master):

$ ./wrk -c 10000 https://localhost:3000/hello
Running 10s test @ https://localhost:3000/hello
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   359.60ms  311.63ms   1.91s    86.49%
    Req/Sec    37.92     60.74   271.00     90.00%
  359 requests in 10.05s, 23.84KB read
  Socket errors: connect 748, read 133, write 0, timeout 26
Requests/sec:     35.70
Transfer/sec:      2.37KB

Full result (current master with low-priority-queue patch):

$ ./wrk -c 10000 https://localhost:3000/hello
Running 10s test @ https://localhost:3000/hello
  2 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    57.13ms   17.17ms 206.04ms   94.08%
    Req/Sec     4.57k     2.20k    9.14k    62.42%
  81793 requests in 10.08s, 5.30MB read
Requests/sec:   8117.43
Transfer/sec:    539.05KB

uNetworking / uSockets

New SSL connection throttling performs even worse (with libuv) #96