vtnerd commented 2 months ago

This is some basic numbers for the REST server to determine what needs to be changed for bottlenecks. I used the wrk2 utility, which seems to place the server under decent load. monero-lws-daemon and monerod were both running on a ryzen 3900x box with 32 GiB RAM whereas wrk2 was done a laptop. A wired connection (to the same switch) was used to ensure that latencies were low and consistent.

Raw Performance Numbers

login

Running 10s test @ [internal_ip]:8080/login 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 476.29ms 840.19ms 4.60s 87.19% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 28.38ms 75.000% 594.43ms 90.000% 1.65s 99.000% 3.51s 99.900% 4.03s 99.990% 4.53s 99.999% 4.60s 100.000% 4.60s

[Mean = 476.288, StdDeviation = 840.193] [Max = 4599.808, Total count = 174799] [Buckets = 27, SubBuckets = 2048]

174807 requests in 10.00s, 34.34MB read Requests/sec: 17483.81 Transfer/sec: 3.43MB

get_address_info

Running 10s test @ [internal_ip]:8080/get_address_info 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 526.71ms 816.44ms 4.26s 87.16% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 152.06ms 75.000% 519.17ms 90.000% 1.65s 99.000% 3.72s 99.900% 4.11s 99.990% 4.22s 99.999% 4.26s 100.000% 4.27s

[Mean = 526.713, StdDeviation = 816.440] [Max = 4263.936, Total count = 174725] [Buckets = 27, SubBuckets = 2048]

174733 requests in 10.00s, 58.99MB read Requests/sec: 17473.53 Transfer/sec: 5.90MB

get_unspent_outs

Running 10s test @ [internal_ip]:8080/get_unspent_outs 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 3.08s 1.86s 7.82s 60.48% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 3.03s 75.000% 4.58s 90.000% 5.53s 99.000% 7.15s 99.900% 7.47s 99.990% 7.75s 99.999% 7.81s 100.000% 7.82s

[Mean = 3079.689, StdDeviation = 1861.498] [Max = 7819.264, Total count = 71066] [Buckets = 27, SubBuckets = 2048]

71074 requests in 10.00s, 14.30MB read Requests/sec: 7106.51 Transfer/sec: 1.43MB

get_random_outs

Running 10s test @ [internal_ip]:8080/get_random_outs 8 threads and 100 connections Thread calibration: mean lat.: 4970.170ms, rate sampling interval: 19087ms Thread calibration: mean lat.: 5377.621ms, rate sampling interval: 16736ms Thread calibration: mean lat.: 4950.556ms, rate sampling interval: 15835ms Thread calibration: mean lat.: 5076.512ms, rate sampling interval: 17317ms Thread calibration: mean lat.: 4642.232ms, rate sampling interval: 16113ms Thread calibration: mean lat.: 6349.897ms, rate sampling interval: 19365ms Thread calibration: mean lat.: 4918.584ms, rate sampling interval: 15532ms Thread calibration: mean lat.: 4544.176ms, rate sampling interval: 15261ms Thread Stats Avg Stdev Max +/- Stdev Latency -nanus -nanus 0.00us 0.00% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency) 50.000% 0.00us 75.000% 0.00us 90.000% 0.00us 99.000% 0.00us 99.900% 0.00us 99.990% 0.00us 99.999% 0.00us 100.000% 0.00us

[Mean = -nan, StdDeviation = -nan] [Max = 0.000, Total count = 0] [Buckets = 27, SubBuckets = 2048]

68 requests in 10.07s, 219.44KB read Socket errors: connect 0, read 0, write 0, timeout 421 Requests/sec: 6.75 Transfer/sec: 21.80KB

Analysis

login and get_address_info both max around ~17,400 requests a second. I requested 20,000 to wrk2 - its not clear why this target couldn't be achieved (link limit or laptop limit).

get_unspent_outs maxed around ~7,100 requests a second. This is expected, and almost certainly due to the ZMQ call within the handler code. The values returned over ZMQ could be cached safely, but when the cache timeout hits the throughput will drop by 50%.

get_random_outs had really low throughput, which is to be expected. This also does an expensive ZMQ call in the handler. Caching this is a little more tricky because the random output selection will be delayed from real-time. Even with caching, when the cache time out hits the throughput of the REST threads will drop dramatically.

Steps from Here

In both cases, the requests/sec drop came from blocking ZMQ calls within the HTTP handler. The "correct" engineering fix to pause/resume the REST handlers so that the ZMQ calls never block any of the handler threads. This cannot be achieved with the epee HTTP server, because the response must be synchronous through this framework.

The steps (in-order) to achieve better throughput with the REST server:

Get a proof-of-concept working with boost::beast.
Test throughput on boost::beast - make sure the requests/sec are similar to the current HTTP server
If boost::beast passes tests (on login, and get_address_info), then get the code in a "shippable" state.
Incorporate AZMQ into the new boost::beast framework, such that get_unspent_outs and random_outs never block on ZMQ calls
Add caching to get_unspent_outs so that throughput on that endpoint improves
Do not cache get_random_outs, as its too risky to give stale data on that call.

vtnerd commented 2 months ago

One point to clarify - the account used had 0 received and 0 spent. There are allocations in the code paths if the account has received or spent funds, that likely would've slowed the responses a little. Testing with the empty account was intentionally done to test the max throughput possible, and compare the slowdowns of the ZMQ calls. Subsequent load tests with accounts with received and spent funds will likely be done to see if its worth attempting a "streaming" design from LMDB directly into JSON.

vtnerd commented 2 months ago

Another clarification, this was using one REST thread. Despite the 32-threads on the server, I don't see a reason to increase the REST thread count because I wanted to test throughput of a single response thread.

vtnerd commented 2 months ago

I've gotten a quick proof-of-concept for boost::beast and the wrk2 load stresser shows that it handles roughly an additional ~600 requests/sec (or 18,000 request/sec total). The latency average and latency stddev are also lowered somewhat.

Given that there was no drop in performance when switching to boost::beast, I will move forward with the attempt to switch to AZMQ so that individual REST handlers can be suspended/resumed. This is likely a bigger overhaul, so expect a delay in updates to this change.

Raw Numbers

Running 10s test @ [internal_ip]:8080/get_address_info 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 460.01ms 269.25ms 952.83ms 57.54% Req/Sec -nan -nan 0.00 0.00% Latency Distribution (HdrHistogram - Recorded Latency)

[Mean = 460.012, StdDeviation = 269.253] [Max = 952.832, Total count = 180905] [Buckets = 27, SubBuckets = 2048]

180913 requests in 10.00s, 50.21MB read Requests/sec: 18093.27 Transfer/sec: 5.02MB

vtnerd commented 2 months ago

I should also mention that some additional constraints on boost::beast should be done somehow, but I have to dig into the library further to figure out how.

vtnerd / monero-lws

Load Testing REST Server #134

Raw Performance Numbers

login

get_address_info

get_unspent_outs

get_random_outs

Analysis

Steps from Here

Raw Numbers