sozu-proxy / sozu

Sōzu HTTP reverse proxy, configurable at runtime, fast and safe, built in Rust. It is awesome!
https://www.sozu.io/
GNU Affero General Public License v3.0
3.12k stars 193 forks source link

Reduce syscalls for performance #1025

Closed Keksoj closed 1 year ago

Keksoj commented 1 year ago

This is a joint effort with @Wonshtrum to reduce the number of syscalls in Sōzu version 0.15. When benchmarking version 0.15 against version 0.13.6, we found a disproportionate number of getrandom and writev syscalls, especially for HTTPS traffic.

writev

Here is the strace of how a simple response is sent to the client in a TLS-encrypted form, in Sōzu 0.13:

writev(8, [{iov_base="\27\3\3\0[\233\\@IW\323\371o\255y`Z\376b7\362vA\321\207Z\253S\361\177\234\220"..., iov_len=96}], 1) = 96
writev(8, [{iov_base="\27\3\3\0006N\236\375\25\6\233\350\363:\227XUx\376\256\25\337~\27\21\fI\202\2452\236,"..., iov_len=59}], 1) = 59
writev(8, [{iov_base="\27\3\3\0\35|N\346\376k\231\360\34\377\f\27\364\372\6;\373\6)\243\343\260'.r\340\340\342"..., iov_len=34}], 1) = 34

It is chunked by a custom BufferQueue in three writes.

Here is the same traffic sent to the client in version 0.15.14. There are more headers than in version 0.13, and for each of them, Sōzu performs a writev call:

writev(11, [{iov_base="\27\3\3\0\31\201\357\330\261j\24\233\356j\317L\253\366\301\vm;h\220m\334\253w-U", iov_len=30}], 1) = 30
writev(11, [{iov_base="\27\3\3\0\22\275#\6(\260\237:\352\23\241d\263S_\241\351\304\300", iov_len=23}], 1) = 23
writev(11, [{iov_base="\27\3\3\0\24\270\235\334 \241\221\340\230\353p\16\360-\325\376r>\262\7V", iov_len=25}], 1) = 25
writev(11, [{iov_base="\27\3\3\0\22\16\235\323\334\316\231o\3\324^WM\242\373N\270\f\220", iov_len=23}], 1) = 23
writev(11, [{iov_base="\27\3\3\0\23\240\307\311\24Z\353\277\v\322!T\254\267\342\27-\350\246\271", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\23P>\250{k\35\273\21!dZr\230f5\323\275\227\344", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\37\250B_%XY&+\250\302.\26\352\n\357\1\313\32vI\301\247\206\304\364h$"..., iov_len=36}], 1) = 36
writev(11, [{iov_base="\27\3\3\0\23\5g9\230\306\210k\336\374@\352\311\253?i\330\255:=", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\23\302\21\242\354\37\10\203\363\3\10\202#2\317\\:\t\220\252", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\23 nt!+E\213\301\337K\1\256\\M,\32\265\274\226", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\258d\314u\244\305\311>F\361\27?\16Z\367\5I\330?yq", iov_len=26}], 1) = 26
writev(11, [{iov_base="\27\3\3\0\23\0\263H\300.U\347H\333}9\320B]\233S\356\371/", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0.:\24\246:c\242G\227s\2358Z-5\200.\362#\16\327uX\310D.)\332"..., iov_len=51}], 1) = 51
writev(11, [{iov_base="\27\3\3\0\23\211`\211H\227L\203im\205my\343\354\211\245\263\327\221", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\30\271\255\337\fG*\231\3263\374\306\244R1\215\204\365\266,j\212\235~\266", iov_len=29}], 1) = 29
writev(11, [{iov_base="\27\3\3\0\23\274;R\220\244I,\217n\215_\354p!gP\375\334\\", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0+r\340\26yL,\5\17u\223\230\362\331\230\257Z\33MFDck\37!\302\307\376"..., iov_len=48}], 1) = 48
writev(11, [{iov_base="\27\3\3\0\23\204f\276\240\216\340\32\257U\266l\23:\307oZRR\35", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\23\362\323\261\355pA\2625[p\317\5\31m\32&\17\230\355", iov_len=24}], 1) = 24
writev(11, [{iov_base="\27\3\3\0\33\237G\307}}\300\227M\240\203\220\212#\f\334\16\315\342\330\253\344\277%\227u\214`", iov_len=32}], 1) = 32

This is due to the way Kawa stores data, and this data is passed to Rustls as is. Fortunately, the Rustls API offers a write_vectored method on its Writer, that performs all writes in a single syscall. We opted for it.

writev(
    11,
    [
        {iov_base="\27\3\3\0\31\210H\371\252\25w\304\275\346u\27\371\334g\345\244\371\331\203c\302\356\324i\367", iov_len=30},
        {iov_base="\27\3\3\0\22\366IR\236\357\374\37\30\310E\330Xr\3249\24r0", iov_len=23},
        {iov_base="\27\3\3\0\24\326\337\223\356R\360\37\215\343a\2L\236\30\24D\31\10\363\16", iov_len=25},
        {iov_base="\27\3\3\0\22\347\305\255\371?\224\6\212\325\345\250*a\250\321i\366\266", iov_len=23},
        {iov_base="\27\3\3\0\23\37\24Ex\334\vHp\2758\207\27G\306\16\226\253\320\240", iov_len=24},
        {iov_base="\27\3\3\0\23\23\362q\352\307\303\364\312]\361(\227<\17\334Y\333g\310", iov_len=24},
        {iov_base="\27\3\3\0\37\216\332\374\376\303H\206\35f\334\310\252\375\341\366\224\261*\367\30mg\\\230dbh"..., iov_len=36},
        {iov_base="\27\3\3\0\23-\312\31\v\353\212\303\265\231H\371\356%\317JX\32Tg", iov_len=24},
        {iov_base="\27\3\3\0\23d\356\0276\3236\252\206^5\346=\234F\2\200\33\314>", iov_len=24},
        {iov_base="\27\3\3\0\23U\30\37x7\5d\304/ZY\274\25\17;\276t]\216", iov_len=24},
        {iov_base="\27\3\3\0\25\240\371q\2416Z\202\6\35\311|\203Bai-\20\217=\3516", iov_len=26},
        {iov_base="\27\3\3\0\23\220\350\323\321bx\321\21:\35\203?\257\313)\200\364E\377", iov_len=24},
        {iov_base="\27\3\3\0.\226\254\30\240\315\307o\250\243b`Q\n\17\226\333\347\256[\331\324#~/\240\206,"..., iov_len=51},
        {iov_base="\27\3\3\0\23\251\207:\350?/\252Z\317\322\0017\24E\256\5*\342\224", iov_len=24},
        {iov_base="\27\3\3\0\30$w0!\205\310\276\372k\204?\375k\334\334\363\314\\!\355\266\0\362\33", iov_len=29},
        {iov_base="\27\3\3\0\23\257\277rj;\3\376\251\10\310\37\265\365\303\216\260\345\361\366", iov_len=24},
        {iov_base="\27\3\3\0+\341\0313\277'\223\3\272W&\316\306\360BYI9\225Vh\347t\3}\303\321\343"..., iov_len=48},
        {iov_base="\27\3\3\0\23\247\33\26M\255%\351\214F\254\177\250+z\253]\17F\325", iov_len=24},
        {iov_base="\27\3\3\0\23fm\262t\334\271>\226\225ITia\250,\31^\37\f", iov_len=24},
        {iov_base="\27\3\3\0\33S\25/zM'\343\302L/l\\\1\22\250M\0AL\357\304\36\227\317\322Q\365", iov_len=32}
    ],
    20
) = 563

This improves performance significantly.

Performance improvement

Sōzu 0.13.6

bombardier -c 200 -n 100000 https://localhost:8443/api -l
Bombarding https://localhost:8443/api with 100000 request(s) using 200 connection(s)
 100000 / 100000 [=====================================================================================================] 100.00% 7784/s 12s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      7837.35    1514.08   14001.32
  Latency       25.58ms    18.76ms   458.92ms
  Latency Distribution
     50%    24.37ms
     75%    25.99ms
     90%    28.25ms
     95%    29.65ms
     99%    32.85ms
  HTTP codes:
    1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.10MB/s

Sōzu 0.15.14 without fixes

bombardier -c 200 -n 100000 https://localhost:8443/api -l   
Bombarding https://localhost:8443/api with 100000 request(s) using 200 connection(s)
 100000 / 100000 [=====================================================================================================] 100.00% 3193/s 31s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      3200.75     737.90    5135.83
  Latency       62.43ms    26.44ms   667.69ms
  Latency Distribution
     50%    60.10ms
     75%    63.60ms
     90%    68.29ms
     95%    71.33ms
     99%    87.74ms
  HTTP codes:
    1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.00MB/s

Sōzu 0.15.14 with both fixes of this pull request

bombardier -c 200 -n 100000 https://localhost:8443/api -l                                                                   12.862s 18:06
Bombarding https://localhost:8443/api with 100000 request(s) using 200 connection(s)
 100000 / 100000 [=====================================================================================================] 100.00% 7661/s 13s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      7785.59    1285.38   13644.94
  Latency       25.73ms    25.89ms   618.15ms
  Latency Distribution
     50%    24.09ms
     75%    26.02ms
     90%    28.26ms
     95%    29.76ms
     99%    34.26ms
  HTTP codes:
    1xx - 0, 2xx - 100000, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     4.86MB/s

As you can see, the fixes restore the performance of Sōzu (in this somewhat limited usecase) back to 0.13.6 levels. The 0.15.14 version still struggles when confronted with many concurrent TLS handshakes, when compared with the 0.13.6, but we're getting there. Note that the throughput has augmented, that's because headers are more numerous since the introduction of Kawa.

getrandom, it's all about TLS 1.3 resumption tickets

We found that Sōzu 0.15.14 would create four times more getrandom syscalls during a TLS handshake than the 0.13.6 version:

// 0.13.6
getrandom("\x03\x81\x50\x75....", 32, 0) = 32
getrandom("\x49\x7d\x38\x98....", 32, 0) = 32
getrandom("\xd3\xde\xb1\x6e", 4, 0)     = 4
// 0.15.14
getrandom("\x73\x30\x04\x5b...", 32, 0) = 32
getrandom("\xc6\xb6\xe5\x7b", 4, 0)     = 4
getrandom("\x4f\xe0\xd1\xe4...", 32, 0) = 32
getrandom("\x13\x7e\xa1\xf5...", 32, 0) = 32
getrandom("\x3c\x5d\x3c\xc9", 4, 0)     = 4
getrandom("\x71\xd1\x73\x76...", 32, 0) = 32
getrandom("\x97\x18\x1a\xb5...", 32, 0) = 32
getrandom("\x7e\x92\xbf\x14", 4, 0)     = 4
getrandom("\xaf\x1f\xb8\x53...", 32, 0) = 32
getrandom("\x20\xef\x36\xad...", 32, 0) = 32
getrandom("\xcd\xe0\x96\xd0", 4, 0)     = 4
getrandom("\x33\xd0\x99\x85...", 32, 0) = 32

After a lot of digging, we found that Rustls 0.19, used in Sōzu 0.13.6, seems to produce one TLS 1.3 ticket (used by a client to resume a TLS session). Producing a ticket needs 3 getrandom syscalls, as far as we understand.

Sōzu 0.15.14 does 4 times as many getrandom calls at this step of the TLS handshake. That is because Rustls 0.21.8, used in Sōzu 0.15, produces four TLS 1.3 tickets by default, since this commit in this PR, meant to resolve this issue, and I quote the issue since it seems relevant:

RFC8446 section 4.6.1 recommends that TLS 1.3 servers send multiple session resumption tickets to clients. In appendix C.4, it's subsequently recommended that clients use tickets at most once to avoid session tracking. The current implementation of ClientSessionMemoryCache does not do this, and some properties of StoresClientSession (and use of the cache in general) make doing so difficult.

The number of tickets is accessible in this public field of the Rustls Configuration of Rustls. It does default to 4. Resetting it to 1 may improve the performance of Sōzu's TLS handshake for intense traffic with a lot of simultaneous TLS handshakes.

A default 1 ticket production seems appropriate for Sōzu for some users, but we may still want to make this number configurable in the Sōzu configuration file, so that any Sōzu user can trade security and performance to its liking. What do you think @FlorentinDUBOIS and @Wonshtrum ?

EDIT: benchmarking this change with tls-perf does NOT seem to improve performance in any relevant way. We may keep the line of code with a set value of 4, and an explanating comment, for future generations of developers.

Comments welcome!