Closed Jake-Shadle closed 3 months ago
Can you run and include the results some benchmarks compared to main?
Build Succeeded :partying_face:
Build Id: e2e97722-c167-429b-82e2-4c22c253940e
The following development images have been built, and will exist for the next 30 days:
To build this version:
git fetch git@github.com:googleforgames/quilkin.git pull/993/head:pr_993 && git checkout pr_993
cargo build
I've reviewed the code and LGTM, once we have some benchmark results so we know that it is and how much of a performance improvement it is, I'll approve
The read-write benchmarks are non-functional even on main, so I need to figure out why they are broken and fix them first I guess.
main:
Aggregated Function Time : count 100000 avg 0.0062609285 +/- 0.01791 min 0.002268194 max 0.753708539 sum 626.092851
# range, mid point, percentile, count
>= 0.00226819 <= 0.003 , 0.0026341 , 0.52, 519
> 0.003 <= 0.004 , 0.0035 , 16.39, 15875
> 0.004 <= 0.005 , 0.0045 , 62.38, 45983
> 0.005 <= 0.006 , 0.0055 , 70.01, 7631
> 0.006 <= 0.007 , 0.0065 , 80.66, 10655
> 0.007 <= 0.008 , 0.0075 , 87.31, 6644
> 0.008 <= 0.009 , 0.0085 , 90.74, 3432
> 0.009 <= 0.01 , 0.0095 , 93.59, 2852
> 0.01 <= 0.011 , 0.0105 , 95.11, 1523
> 0.011 <= 0.012 , 0.0115 , 96.19, 1079
> 0.012 <= 0.014 , 0.013 , 97.25, 1057
> 0.014 <= 0.016 , 0.015 , 97.89, 642
> 0.016 <= 0.018 , 0.017 , 98.31, 414
> 0.018 <= 0.02 , 0.019 , 98.59, 280
> 0.02 <= 0.025 , 0.0225 , 99.00, 410
> 0.025 <= 0.03 , 0.0275 , 99.32, 324
> 0.03 <= 0.035 , 0.0325 , 99.53, 207
> 0.035 <= 0.04 , 0.0375 , 99.69, 165
> 0.04 <= 0.045 , 0.0425 , 99.80, 111
> 0.045 <= 0.05 , 0.0475 , 99.90, 94
> 0.05 <= 0.06 , 0.055 , 99.93, 36
> 0.06 <= 0.07 , 0.065 , 99.94, 8
> 0.07 <= 0.08 , 0.075 , 99.94, 4
> 0.7 <= 0.753709 , 0.726854 , 100.00, 55
# target 50% 0.00473084
# target 75% 0.00646851
# target 90% 0.00878467
# target 99% 0.0250617
# target 99.9% 0.0508333
Error cases : count 55 avg 0.75112423 +/- 0.0009158 min 0.750025763 max 0.753708539 sum 41.3118325
# range, mid point, percentile, count
>= 0.750026 <= 0.753709 , 0.751867 , 100.00, 55
# target 50% 0.751833
# target 75% 0.752771
# target 90% 0.753333
# target 99% 0.753671
# target 99.9% 0.753705
Sockets used: 59 (for perfect no error run, would be 4)
Total Bytes sent: 2400000, received: 2398680
udp OK : 99945 (99.9 %)
udp timeout : 55 (0.1 %)
All done 100000 calls (plus 0 warmup) 6.261 ms avg, 635.6 qps
pr:
Aggregated Function Time : count 100000 avg 0.0062134501 +/- 0.01246 min 0.002205052 max 0.752862069 sum 621.345007
# range, mid point, percentile, count
>= 0.00220505 <= 0.003 , 0.00260253 , 0.34, 342
> 0.003 <= 0.004 , 0.0035 , 15.26, 14916
> 0.004 <= 0.005 , 0.0045 , 61.80, 46545
> 0.005 <= 0.006 , 0.0055 , 70.47, 8672
> 0.006 <= 0.007 , 0.0065 , 78.83, 8352
> 0.007 <= 0.008 , 0.0075 , 84.10, 5272
> 0.008 <= 0.009 , 0.0085 , 87.35, 3250
> 0.009 <= 0.01 , 0.0095 , 91.30, 3955
> 0.01 <= 0.011 , 0.0105 , 93.55, 2244
> 0.011 <= 0.012 , 0.0115 , 94.95, 1403
> 0.012 <= 0.014 , 0.013 , 96.67, 1718
> 0.014 <= 0.016 , 0.015 , 97.62, 955
> 0.016 <= 0.018 , 0.017 , 98.16, 539
> 0.018 <= 0.02 , 0.019 , 98.51, 344
> 0.02 <= 0.025 , 0.0225 , 99.07, 560
> 0.025 <= 0.03 , 0.0275 , 99.40, 333
> 0.03 <= 0.035 , 0.0325 , 99.61, 206
> 0.035 <= 0.04 , 0.0375 , 99.73, 122
> 0.04 <= 0.045 , 0.0425 , 99.86, 133
> 0.045 <= 0.05 , 0.0475 , 99.92, 58
> 0.05 <= 0.06 , 0.055 , 99.96, 41
> 0.06 <= 0.07 , 0.065 , 99.97, 9
> 0.07 <= 0.08 , 0.075 , 99.97, 5
> 0.08 <= 0.09 , 0.085 , 99.97, 1
> 0.7 <= 0.752862 , 0.726431 , 100.00, 25
# target 50% 0.00474642
# target 75% 0.00654179
# target 90% 0.00967029
# target 99% 0.0244018
# target 99.9% 0.0483621
Error cases : count 25 avg 0.75080686 +/- 0.0007781 min 0.750041919 max 0.752862069 sum 18.7701715
# range, mid point, percentile, count
>= 0.750042 <= 0.752862 , 0.751452 , 100.00, 25
# target 50% 0.751393
# target 75% 0.752128
# target 90% 0.752568
# target 99% 0.752833
# target 99.9% 0.752859
Sockets used: 29 (for perfect no error run, would be 4)
Total Bytes sent: 2400000, received: 2399400
udp OK : 99975 (100.0 %)
udp timeout : 25 (0.0 %)
All done 100000 calls (plus 0 warmup) 6.213 ms avg, 638.9 qps
This basically lines up with what I expected, they are very close to each other in the simplest case of 1<->1, but I would expect the difference to grow a bit with more clients and servers. It's at least not worse.
This is a fairly major change to swap out
tokio-uring
with the lower levelio-uring
, which has some upsides and downsides.Upsides
In
tokio-uring
, every udp recv_from and send_to performs 3 heap allocations (maybe even more in other parts of the code?) which is extremely wasteful in the context of a proxy that can be sending and receiving many thousands of packets a second. Moving toio-uring
means we need to take responsibility for the lifetimes of memory being written/read by the kernel during I/O, but means we can minimize/get rid of memory allocations since we have the full context. For example, the QCMP loop now doesn't use the heap at all in favor of just reusing stack allocations.Additionally, the current code which forwards packets either downstream or upstream only ever sends 1 packet at a time per worker/session, the new code takes advantage of not being async/await by just sending up to a few thousand packets concurrently, reducing a (probably minor) throughput bottleneck.
Downsides
A lot more code, some of which is unsafe, though slightly less than it could have been, as now the session and packet_router both share the same implementation. The non-linux code is also now separated since they are no longer really compatible since the io uring loop is not async so we can't pretend the code is the same between linux and non-linux, which also contributes to code increase.
Overall, it's just simply more complicated relative to the old code, but does give us tighter control.