Closed not-implemented closed 2 years ago
/ Maybe it would be a better approach, to throttle new connections at accept()
, and do not accept all new connections at once? (https://github.com/uNetworking/uSockets/blob/master/src/loop.c#L192) /
... no, I see, that does not solve the problem.
Another idea:
us_poll_change()
should be called and remove the LIBUS_SOCKET_READABLE
flag, and the socket has to be put into a "postpone"-list.us_internal_loop_pre()
the first 5 (or MAX_HANDSHAKES_PER_LOOP_ITERATION
) sockets can be changed back with us_poll_change()
to LIBUS_SOCKET_READABLE
I didn't test this, and maybe there are race-conditions ... but it seem the only way to avoid unnecessary event-loop-iterations.
I think the idea of throttling SSL handshakes is somewhat sane. A user was very pissed about major spikes in latency when performing many upgrades and this simple fix solved the lately issues.
It could probably be made more efficient though. What if you just increase it a bit? That should balance things a bit?
I think the idea of throttling SSL handshakes is somewhat sane.
Yes, of course!
What if you just increase it a bit?
When I set MAX_HANDSHAKES_PER_LOOP_ITERATION
to 50, it recovers a little faster from 100% CPU usage (1 minute or so ... but still very slow).
I also put some debugging output in ssl_ignore_data()
to see, how often the function is called inside the same us_loop_iteration_number
- it starts with some hundred calls per iteration-number, and then at one point, it is called ~49000 for every iteration-number - and then it is stuck a long time at 100% CPU until it recovers. Example:
last_iteration_nr: 176, sameLoopCounter: 0
last_iteration_nr: 177, sameLoopCounter: 1
last_iteration_nr: 178, sameLoopCounter: 380
last_iteration_nr: 179, sameLoopCounter: 618
last_iteration_nr: 180, sameLoopCounter: 569
last_iteration_nr: 181, sameLoopCounter: 697
last_iteration_nr: 182, sameLoopCounter: 649
last_iteration_nr: 183, sameLoopCounter: 779
last_iteration_nr: 184, sameLoopCounter: 732
last_iteration_nr: 185, sameLoopCounter: 993
last_iteration_nr: 186, sameLoopCounter: 943
last_iteration_nr: 187, sameLoopCounter: 49099
last_iteration_nr: 188, sameLoopCounter: 49148
last_iteration_nr: 189, sameLoopCounter: 49147
last_iteration_nr: 190, sameLoopCounter: 49149
...
Those numbers don't make any sense - I think you have found a bug somewhere.
We only read a maximum of 1024 events per loop iteration, so a maximum of 1024-5 reads should be ignored. Anything crazy like 50k is broken and wrong.
Please continue to debug and see if you can figure out where things break, maybe the last_iteration_nr is not properly increased? I think the bug should be fairly simple once you find it.
Okay, I had a look into libuv source now:
In uv_run()
in unix/core.c@381 uv__io_poll()
is called after uv__run_prepare()
, which increases the last_iteration_nr
in our case ... fine.
Looking into uv__io_poll()
: The epoll_wait()
is also inside a for-loop, see unix/linux-core.c@284. And there is a check at the end at unix/linux-core.c@447, if we have 1024 (ARRAY_SIZE(events)
) ready file descriptors, and if so, it continues the loop 48 times (see count
init at unix/linux-core.c@281).
So last_iteration_nr
is never increased in between. That exactly explains those numbers of nearly 49152 (48 * 1024) calls per loop iteration, and why it jumps from 943 calls directly to 49099 calls in my last test.
It seems, libuv is just not designed for ignoring events :-(
But what do you think about my comment above?
Wow libuv.. wtf.
Okay but just flip the counting like this -
1024/5 = 204 ignores in a row. So just have a counter set to 204 and every time you ignore a handshake you increase the counter up to 204 then you let one handshake happen then reset counter to 0 and ignore up to 204 again.
That should have the same effect, but work with any size of events per iteration.
In going to think about this. I don't like the us_poll_change solution. This counting should be a lot more forgiving if you only have connects though, it needs to take into account how many regular data events it gets
Hmm ... why don't you like the us_poll_change()
solution?
The counting solution will never scale well under very high loads - when having really 49000 half-open SSL-connections, you have the same problem even without libuv. Libuv just showed the problem earlier.
I still suggest the following:
ignore_data
callback could be changed to a is_low_priority
callback (that just returns the result of SSL_in_init()
for crypto connections)us_internal_dispatch_ready_poll()
LIBUS_SOCKET_READABLE
for throttled sockets and put to a low-priority-queueus_internal_loop_post()
(better than "pre" I think), the first N sockets in the queue will be reactivated ... so we get a real "FIFO" queue for free, and not a random priorityBut that would scale well under high loads I think - and does not have an impact on low loads. Or did I miss something?
I don't like it short term, too complex. Long term maybe something like that should be implemented but right now that's too scary
As I really like the idea of throttling SSL connections (and the current situation is a blocker for us to upgrade to uWebSockets 0.17), I implemented the queue now. See #97
It seems to work. The performance is well (~40 seconds to fully establish 5000 concurrent SSL connections in a first simple test - that's ~8ms CPU per connection, which is in any case much much better).
I just reuse the doubly-linked-list fields of the sockets and remove them from the "normal" list, which seems to be only used for timeouts - which is fine I think, as we do not need timeout-handling for throttled connections. After delaying they are reinserted into the "normal" list.
Let me know, if you like it, or if you have some better ideas :-)
Sorry for the delay; my plan is to release v18 with a few high-level features that's missing and high priority. After that 18th version runs stable, I should be able to work a lot on low level stuff again. So uSockets is quite low prio right now.
So you put the low-priority-queue to the low-priority-queue :-D ;-)
No hurry ... we still run the patched version in production.
Ba Dum Tish!
Do you have any stability issues overall? Like, you run servers for weeks without issues right?
No crashes or other stability issues since 4 weeks :-)
(One instance had a little leaky memory usage for one day, but this may also be an issue in application code and was not reproducable since then)
23 days ago, wtf I missed this response.
Great, thanks, so you've successfully moved from v0.14 to v17 and lived to tell the tale!
So ... no stability issues since > 7 weeks now :-)
The counting solution will never scale well under very high loads
Benchmarks please
which is fine I think, as we do not need timeout-handling for throttled connections
They need timeouts, otherwise you'll have memory leaks since they can get stuck
so we get a real "FIFO" queue for free, and not a random priority
It's not supposed to be fifo. The kernel has no guarantees for ordering
I would like to see numbers for counting solution
Apparently they are indeed fifo
If the time sweep goes over both low priority list and regular list, ticking, you solve the potential memory leaks
The counting solution will never scale well under very high loads
Benchmarks please
Ok, I'll prepare some more detailled benchmarks shortly.
which is fine I think, as we do not need timeout-handling for throttled connections
They need timeouts, otherwise you'll have memory leaks since they can get stuck
While being in the low-priority-queue, the responsibility to handle the socket is still on our side, not on client side. We are responsible to put the socket back to normal processing, and handle the incoming SSL-Handshake, when the CPU usage allows it. After the socket has been put back, it also follows the normal timeout handling. The low-priority-queue should always be empty after handling the incoming connection peaks (at least if the current implementation has no error). So there should not be any memory leaks, regardless of clients behaviour. The idea was, to "extend" the timeouts in overload-scenarios by removing it from the "normal" list. And the main reason, of course, we save two additional pointers per socket for a second doubly-linked-list. But of course, a second list will also be possible, but I still think, there is no need for it.
so we get a real "FIFO" queue for free, and not a random priority
It's not supposed to be fifo. The kernel has no guarantees for ordering
Ah okay ... but see https://github.com/uNetworking/uSockets/pull/97#issuecomment-620672702 ... over the time I changed the low-priority to a LIFO queue anyway. See my comments in #97 for an explanation (I also noted it in the code comments).
I would like to see numbers for counting solution
As I said ... I'll prepare benachmarks :-)
We are responsible to put the socket back to normal processing, and handle the incoming SSL-Handshake, when the CPU usage allows it
Right, but what if you only got a small portion of the needed data, say 1 byte? Or what if finishing the handshake requires writing then reading more?
Does it go back to timeout even in those cases?
Yes of course, maybe this was a small misunderstanding. Just have a look at the code:
The "lifetime" of the low-priority-queue ist just from the "LIBUS_SOCKET_READABLE" event ...
https://github.com/uNetworking/uSockets/pull/97/files#diff-3ca823489e2803bdd90f0774bbdd0c09R267
... until we have time/budget to handle this one "LIBUS_SOCKET_READABLE"-event ...
https://github.com/uNetworking/uSockets/pull/97/files#diff-3ca823489e2803bdd90f0774bbdd0c09R118
... and is removed from the low-priority-queue immediately before we handle it (and maybe reinserted on the next "LIBUS_SOCKET_READABLE"-event ...). It is NOT in the low-priority-queue the whole handshake. So the change is as unintrusive as possible.
The idea was "in case of overload, just tell epoll() to not inform us about the readability for now" ... and to not forget to tell epoll() when we are interested in the readability again, we need the low-priority-queue. And this happens in us_internal_handle_low_priority_sockets() which is now called from us_internal_loop_pre() on every event loop.
@alexhultman Okay, I did some more detailled benchmarking now - Setup:
I used your "examples/HelloWorld.cpp" from uWebSockets (with a proper self signed cert) (commit c7506a0)
Building: WITH_OPENSSL=1 make examples
ulimit -n 20000
./HelloWorld
I used "wrk" for benchmarking:
ulimit -n 20000
wrk -c 10000 https://localhost:3000/hello
So we have up to 10000 parallel incoming SSL connections. Note that the benchmark is even without libuv, so we even do not have the "48x"-problem here (see https://github.com/uNetworking/uSockets/issues/96#issuecomment-615596193).
Benchmark results (both on the same machine):
Full result (current master):
$ ./wrk -c 10000 https://localhost:3000/hello
Running 10s test @ https://localhost:3000/hello
2 threads and 10000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 359.60ms 311.63ms 1.91s 86.49%
Req/Sec 37.92 60.74 271.00 90.00%
359 requests in 10.05s, 23.84KB read
Socket errors: connect 748, read 133, write 0, timeout 26
Requests/sec: 35.70
Transfer/sec: 2.37KB
Full result (current master with low-priority-queue patch):
$ ./wrk -c 10000 https://localhost:3000/hello
Running 10s test @ https://localhost:3000/hello
2 threads and 10000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 57.13ms 17.17ms 206.04ms 94.08%
Req/Sec 4.57k 2.20k 9.14k 62.42%
81793 requests in 10.08s, 5.30MB read
Requests/sec: 8117.43
Transfer/sec: 539.05KB
While upgrading from uWebSockets 0.14, I did a benchmark with 5000 concurrent SSL (WebSocket) connections (all opened in parallel!) and noticed much (!) higher CPU usage compared to 0.14. The process was even stuck at 100% CPU for minutes instead of seconds (but the event-loop was still responsive). Without SSL it performs very well.
While researching I found the
MAX_HANDSHAKES_PER_LOOP_ITERATION
in src/crypto/openssl.c and the idea behind (which is nice).But Valgrind benchmarking shows that in my case
ssl_ignore_data()
was called 3323616 times, andssl_on_data()
only 747 times (I didn't wait for all 5000 connections).It seems a lot of time is wasted, iterating the event-loop and the ready handles over and over, just for ignoring most of the new connections in the queue.
When disabling this feature (I just set
MAX_HANDSHAKES_PER_LOOP_ITERATION
to 5000), it performs well (comparable to 0.14).I currently don't have an idea, how the throttling of new connections could be implemented more performant - but currently it seems to be worse than without throttling under very high loads.