Open vtnerd opened 2 months ago
One point to clarify - the account used had 0 received and 0 spent. There are allocations in the code paths if the account has received or spent funds, that likely would've slowed the responses a little. Testing with the empty account was intentionally done to test the max throughput possible, and compare the slowdowns of the ZMQ calls. Subsequent load tests with accounts with received and spent funds will likely be done to see if its worth attempting a "streaming" design from LMDB directly into JSON.
Another clarification, this was using one REST thread. Despite the 32-threads on the server, I don't see a reason to increase the REST thread count because I wanted to test throughput of a single response thread.
I've gotten a quick proof-of-concept for boost::beast
and the wrk2
load stresser shows that it handles roughly an additional ~600 requests/sec (or 18,000 request/sec total). The latency average and latency stddev are also lowered somewhat.
Given that there was no drop in performance when switching to boost::beast
, I will move forward with the attempt to switch to AZMQ
so that individual REST handlers can be suspended/resumed. This is likely a bigger overhaul, so expect a delay in updates to this change.
Running 10s test @ [internal_ip]:8080/get_address_info 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 460.01ms 269.25ms 952.83ms 57.54% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency)
[Mean = 460.012, StdDeviation = 269.253] [Max = 952.832, Total count = 180905] [Buckets = 27, SubBuckets = 2048]
180913 requests in 10.00s, 50.21MB read Requests/sec: 18093.27 Transfer/sec: 5.02MB
I should also mention that some additional constraints on boost::beast
should be done somehow, but I have to dig into the library further to figure out how.
This is some basic numbers for the REST server to determine what needs to be changed for bottlenecks. I used the
wrk2
utility, which seems to place the server under decent load.monero-lws-daemon
andmonerod
were both running on a ryzen 3900x box with 32 GiB RAM whereaswrk2
was done a laptop. A wired connection (to the same switch) was used to ensure that latencies were low and consistent.Raw Performance Numbers
login
Running 10s test @ [internal_ip]:8080/login 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 476.29ms 840.19ms 4.60s 87.19% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 28.38ms 75.000% 594.43ms 90.000% 1.65s 99.000% 3.51s 99.900% 4.03s 99.990% 4.53s 99.999% 4.60s 100.000% 4.60s
[Mean = 476.288, StdDeviation = 840.193] [Max = 4599.808, Total count = 174799] [Buckets = 27, SubBuckets = 2048]
174807 requests in 10.00s, 34.34MB read Requests/sec: 17483.81 Transfer/sec: 3.43MB
get_address_info
Running 10s test @ [internal_ip]:8080/get_address_info 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 526.71ms 816.44ms 4.26s 87.16% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 152.06ms 75.000% 519.17ms 90.000% 1.65s 99.000% 3.72s 99.900% 4.11s 99.990% 4.22s 99.999% 4.26s 100.000% 4.27s
[Mean = 526.713, StdDeviation = 816.440] [Max = 4263.936, Total count = 174725] [Buckets = 27, SubBuckets = 2048]
174733 requests in 10.00s, 58.99MB read Requests/sec: 17473.53 Transfer/sec: 5.90MB
get_unspent_outs
Running 10s test @ [internal_ip]:8080/get_unspent_outs 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.08s 1.86s 7.82s 60.48% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 3.03s 75.000% 4.58s 90.000% 5.53s 99.000% 7.15s 99.900% 7.47s 99.990% 7.75s 99.999% 7.81s 100.000% 7.82s
[Mean = 3079.689, StdDeviation = 1861.498] [Max = 7819.264, Total count = 71066] [Buckets = 27, SubBuckets = 2048]
71074 requests in 10.00s, 14.30MB read Requests/sec: 7106.51 Transfer/sec: 1.43MB
get_random_outs
Running 10s test @ [internal_ip]:8080/get_random_outs 8 threads and 100 connections Thread calibration: mean lat.: 4970.170ms, rate sampling interval: 19087ms Thread calibration: mean lat.: 5377.621ms, rate sampling interval: 16736ms Thread calibration: mean lat.: 4950.556ms, rate sampling interval: 15835ms Thread calibration: mean lat.: 5076.512ms, rate sampling interval: 17317ms Thread calibration: mean lat.: 4642.232ms, rate sampling interval: 16113ms Thread calibration: mean lat.: 6349.897ms, rate sampling interval: 19365ms Thread calibration: mean lat.: 4918.584ms, rate sampling interval: 15532ms Thread calibration: mean lat.: 4544.176ms, rate sampling interval: 15261ms Thread Stats Avg Stdev Max +/- Stdev Latency -nanus -nanus 0.00us 0.00% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 0.00us 75.000% 0.00us 90.000% 0.00us 99.000% 0.00us 99.900% 0.00us 99.990% 0.00us 99.999% 0.00us 100.000% 0.00us
[Mean = -nan, StdDeviation = -nan] [Max = 0.000, Total count = 0] [Buckets = 27, SubBuckets = 2048]
68 requests in 10.07s, 219.44KB read Socket errors: connect 0, read 0, write 0, timeout 421 Requests/sec: 6.75 Transfer/sec: 21.80KB
Analysis
login
andget_address_info
both max around ~17,400 requests a second. I requested 20,000 towrk2
- its not clear why this target couldn't be achieved (link limit or laptop limit).get_unspent_outs
maxed around ~7,100 requests a second. This is expected, and almost certainly due to the ZMQ call within the handler code. The values returned over ZMQ could be cached safely, but when the cache timeout hits the throughput will drop by 50%.get_random_outs
had really low throughput, which is to be expected. This also does an expensive ZMQ call in the handler. Caching this is a little more tricky because the random output selection will be delayed from real-time. Even with caching, when the cache time out hits the throughput of the REST threads will drop dramatically.Steps from Here
In both cases, the requests/sec drop came from blocking ZMQ calls within the HTTP handler. The "correct" engineering fix to pause/resume the REST handlers so that the ZMQ calls never block any of the handler threads. This cannot be achieved with the epee HTTP server, because the response must be synchronous through this framework.
The steps (in-order) to achieve better throughput with the REST server:
boost::beast
.boost::beast
- make sure the requests/sec are similar to the current HTTP serverboost::beast
passes tests (onlogin
, andget_address_info
), then get the code in a "shippable" state.boost::beast
framework, such thatget_unspent_outs
andrandom_outs
never block on ZMQ callsget_unspent_outs
so that throughput on that endpoint improvesget_random_outs
, as its too risky to give stale data on that call.