429 response codes at lower than expected rps rates

erikreppel commented 1 year ago

Describe the bug

I've noticed while doing some load testing against our reth node that we get 429 responses while resource utilization is still quite low, across a variety of different configurations

--rpc-max-connections 50
        ┌─────────────────┐
    │ success vs load │
    └─────────────────┘
      rate (rps)  │      RETH  
    ──────────────┼────────────
              10  │  1.000000  
             100  │  0.465000  
             250  │  0.157467  
             500  │  0.108333  
             750  │  0.099422  
           1,000  │  0.076367  

    ┌────────────────────┐
    │ throughput vs load │
    └────────────────────┘
      rate (rps)  │  RETH (s)  
    ──────────────┼────────────
              10  │      10.0  
             100  │      44.2  
             250  │      37.5  
             500  │      52.1  
             750  │      72.7  
           1,000  │      74.6  

    ┌─────────────┐
    │ p90 vs load │
    └─────────────┘
      rate (rps)  │  RETH (s)  
    ──────────────┼────────────
              10  │  0.383137  
             100  │  1.190531  
             250  │  0.852115  
             500  │  0.425434  
             750  │  0.275404  
           1,000  │  0.142144

--rpc-max-connections 100
        ┌─────────────────┐
    │ success vs load │
    └─────────────────┘
      rate (rps)  │      RETH  
    ──────────────┼────────────
              10  │  1.000000  
             100  │  0.473000  
             250  │  0.163333  
             500  │  0.100813  
             750  │  0.069022  
           1,000  │  0.052233  

    ┌────────────────────┐
    │ throughput vs load │
    └────────────────────┘
      rate (rps)  │  RETH (s)  
    ──────────────┼────────────
              10  │       9.9  
             100  │      43.9  
             250  │      37.5  
             500  │      44.9  
             750  │      48.4  
           1,000  │      47.7  

    ┌─────────────┐
    │ p90 vs load │
    └─────────────┘
      rate (rps)  │  RETH (s)  
    ──────────────┼────────────
              10  │  5.643809  
             100  │  2.775378  
             250  │  1.683839  
             500  │  0.539247  
             750  │  0.300272  
           1,000  │  0.283837

--rpc-max-connections 1000
    ┌─────────────────┐
    │ success vs load │
    └─────────────────┘
      rate (rps)  │         RETH  
    ──────────────┼───────────────
              10  │     1.000000  
             100  │     0.872333  
             250  │     0.080667  
             500  │     0.000000  
             750  │     0.000000  
           1,000  │  3.333333e-5  

    ┌────────────────────┐
    │ throughput vs load │
    └────────────────────┘
      rate (rps)  │   RETH (s)  
    ──────────────┼─────────────
              10  │   9.952439  
             100  │  46.269959  
             250  │  10.083997  
             500  │   0.000000  
             750  │   0.000000  
           1,000  │   0.021663  

    ┌─────────────┐
    │ p90 vs load │
    └─────────────┘
      rate (rps)  │   RETH (s)  
    ──────────────┼─────────────
              10  │   0.805212  
             100  │  30.000239  
             250  │  30.001930  
             500  │  30.001366  
             750  │  30.001275  
           1,000  │  30.000868

--rpc-max-connections 5000
        ┌─────────────────┐
    │ success vs load │
    └─────────────────┘
      rate (rps)  │      RETH  
    ──────────────┼────────────
              10  │  1.000000  
             100  │  0.792667  
             250  │  0.097333  
             500  │  0.016267  
             750  │  0.003689  
           1,000  │  0.000600  

    ┌────────────────────┐
    │ throughput vs load │
    └────────────────────┘
      rate (rps)  │   RETH (s)  
    ──────────────┼─────────────
              10  │   9.922726  
             100  │  40.999452  
             250  │  12.169723  
             500  │   4.171188  
             750  │   1.394754  
           1,000  │   0.300871  

    ┌─────────────┐
    │ p90 vs load │
    └─────────────┘
      rate (rps)  │   RETH (s)  
    ──────────────┼─────────────
              10  │   0.525485  
             100  │  30.000688  
             250  │  30.001450  
             500  │  30.000628  
             750  │  30.000345  
           1,000  │  30.000216

full test output:

{
    "version": "0.2.5",
    "type": "single_test",
    "nodes": {
        "RETH": {
            "name": "RETH",
            "url": "http://reth:8545",
            "remote": null,
            "client_version": "reth/v0.1.0-alpha.3-31af4d5/x86_64-unknown-linux-gnu",
            "network": "ethereum"
        }
    },
    "results": {
        "RETH": {
            "target_rate": [
                10,
                100,
                250,
                500,
                750,
                1000
            ],
            "actual_rate": [
                10.033315135287607,
                100.03532163844285,
                250.0373239798194,
                499.95745432057777,
                750.0343432225372,
                1000.0367458168721
            ],
            "target_duration": [
                30,
                30,
                30,
                30,
                30,
                30
            ],
            "actual_duration": [
                29.900386458,
                29.98940725,
                29.995521791,
                30.002552958,
                29.998626334,
                29.998897666
            ],
            "requests": [
                300,
                3000,
                7500,
                15000,
                22500,
                30000
            ],
            "throughput": [
                9.967055945466727,
                80.2007838214031,
                81.4409657715031,
                79.71695129371324,
                78.16116869006552,
                67.25031723994633
            ],
            "success": [
                1.0,
                0.822,
                0.3504,
                0.16546666666666668,
                0.10848888888888888,
                0.0723
            ],
            "min": [
                0.041514292,
                0.01152825,
                0.01161325,
                0.011028208,
                0.011048792,
                0.010938667
            ],
            "mean": [
                0.188656776,
                0.765482205,
                0.396466551,
                0.175997492,
                0.145290137,
                0.159776724
            ],
            "p50": [
                0.164362051,
                0.809208365,
                0.027378363,
                0.01950688,
                0.019676348,
                0.025054497
            ],
            "p90": [
                0.280936875,
                1.328579104,
                1.196065002,
                0.840960519,
                0.611488142,
                0.444316802
            ],
            "p95": [
                0.409442041,
                1.498301806,
                1.437012983,
                0.994860251,
                1.096046871,
                0.915997179
            ],
            "p99": [
                0.686207562,
                1.999809025,
                2.001787914,
                1.527386355,
                1.637144773,
                2.284471157
            ],
            "max": [
                0.838930792,
                2.899618042,
                5.233336416,
                2.715129792,
                3.249963,
                6.167814917
            ],
            "status_codes": [
                {
                    "200": 300
                },
                {
                    "200": 2466,
                    "429": 534
                },
                {
                    "200": 2628,
                    "429": 4872
                },
                {
                    "200": 2482,
                    "429": 12518
                },
                {
                    "200": 2441,
                    "429": 20059
                },
                {
                    "200": 2169,
                    "429": 27831
                }
            ],
            "errors": [
                [],
                [
                    "429 Too Many Requests"
                ],
                [
                    "429 Too Many Requests"
                ],
                [
                    "429 Too Many Requests"
                ],
                [
                    "429 Too Many Requests"
                ],
                [
                    "429 Too Many Requests"
                ]
            ],
            "first_request_timestamp": [
                "2023-07-17T18:41:03.176619792-04:00",
                "2023-07-17T18:41:34.722660083-04:00",
                "2023-07-17T18:42:15.835315709-04:00",
                "2023-07-17T18:42:58.820529459-04:00",
                "2023-07-17T18:43:41.039177666-04:00",
                "2023-07-17T18:44:25.508792417-04:00"
            ],
            "last_request_timestamp": [
                "2023-07-17T18:41:33.07700625-04:00",
                "2023-07-17T18:42:04.712067333-04:00",
                "2023-07-17T18:42:45.8308375-04:00",
                "2023-07-17T18:43:28.823082417-04:00",
                "2023-07-17T18:44:11.037804-04:00",
                "2023-07-17T18:44:55.507690083-04:00"
            ],
            "last_response_timestamp": [
                "2023-07-17T18:41:33.275778625-04:00",
                "2023-07-17T18:42:05.47048925-04:00",
                "2023-07-17T18:42:48.10408825-04:00",
                "2023-07-17T18:43:29.955689042-04:00",
                "2023-07-17T18:44:12.269519291-04:00",
                "2023-07-17T18:44:57.761428375-04:00"
            ],
            "final_wait_time": [
                0.198772375,
                0.758421917,
                2.27325075,
                1.132606625,
                1.231715291,
                2.253738292
            ],
            "raw_output": [
                null,
                null,
                null,
                null,
                null,
                null
            ]
        }
    }
}

Steps to reproduce

run reth with --rpc-max-connections as set in above results,

flood eth_getLogs RETH=http://reth:8545 --rates 10 100 250 500 750 1000 --duration 30

also observable on flood eth_getBlockByNumber RETH=http://reth:8545 --rates 10 100 250 500 750 1000 --duration 30

Node logs

No response

Platform(s)

Linux (x86)

What version/commit are you on?

reth/v0.1.0-alpha.3-31af4d5/x86_64-unknown-linux-gnu

What database version are you on?

reth db version
Current database version: 1
Local database version: 1

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

[X] I agree to follow the Code of Conduct

mattsse commented 1 year ago

just to summarize, the issue is that you get 429 for rates lower than the configured --rpc-max-connections?

can you try this with a single --rates 1000

one issue here is that the server is not necessarily back at 0 if the next flood run for the next rate is started, because vegeta just drops the connection if a run finished, but the response for that dropped request can be processed atm

erikreppel commented 1 year ago

correct, that the issue.

Running as a single run seems to have helped, but looks like at even --rates 100 we're hitting too many open files issues

erikreppel commented 1 year ago

Increasing the soft limit of open files seems to have helped, but now seeing the 429 issue again

flood eth_getLogs RETH=http://127.0.0.1:8545 --rates 1000 --duration 30
...
 "status_codes": [
        {
          "0": 4955,
          "200": 29,
          "429": 25016
        }
      ],
      "errors": [
        [
          "429 Too Many Requests",
          "Post \"http://127.0.0.1:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
        ]
      ],
      ...

mattsse commented 1 year ago

while we raise the fd limit, this will pick the hard resource limit if it's lower than the actual kernel limit. ulimit -Hn

but you can increase this manually, I believe it's in /etc/security/limits.conf or via sysctl

erikreppel commented 1 year ago

Currently have

ubuntu@reth:~$ ulimit -Sn
1048576
ubuntu@reth:~$ cat /proc/sys/fs/file-max
9223372036854775807
ubuntu@reth:~$ ulimit -Hn
1048576

mattsse commented 1 year ago

that should be plenty,

429 is only handled after the request has been received

erikreppel commented 1 year ago

yep. Any insight on the cause of the 429s? Resource utilization on the instance is still quite low

mattsse commented 1 year ago

need to check if the limits are set correctly but the 429 is only ever used here

https://github.com/paritytech/jsonrpsee/blob/54f4dcdfee519b9c2d2b787be9c40c8e7fc67cff/server/src/server.rs#L775-L784

how many 429 do you get if --rpx-max-conn matches the --rate?

mattmalec commented 1 year ago

only a small % of requests go through if the max conns are set to the rate - in this case, they were both 1000, and our duration was 30 seconds


{
  "200": 3252,
  "429": 26747
}

mattmalec commented 1 year ago

if i set the max conns to 1 million, the requests never actually go through

      "status_codes": [
        {
          "0": 29991,
          "200": 9
        }
      ],
      "errors": [
        [
          "Post \"http://localhost:8545\": dial tcp 0.0.0.0:0->[::1]:8545: connect: connection refused",
          "Post \"http://localhost:8545\": dial tcp 0.0.0.0:0->[::1]:8545: bind: address already in use",
          "Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
        ]
      ]

mattsse commented 1 year ago

combined these are 30k which matches 30 * 1000 or 1000 per second,

so this is the same issue as mentioned above,

I'm not exactly sure if the connection handler is able to drop the connection and cancel the request if the client never sends a FIN

need to look into this

mattsse commented 1 year ago

status code 0 is not sent from the server, this must be something else like

bind: address already in use

mattmalec commented 1 year ago

status code 0 is not sent from the server, this must be something else like

im curious where that would be coming from then - the only outlier is the max conns arg being 1000 vs 1 million

mattmalec commented 1 year ago

also interesting find, i set the max conns to 20k and i see this:

      "status_codes": [
        {
          "0": 19991,
          "200": 7,
          "429": 10002
        }
      ],
      "errors": [
        [
          "429 Too Many Requests",
          "Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
        ]
      ],

mattsse commented 1 year ago

Okay, I ran some tests with an endpoint intentionally exceeding the 30s timeout used in flood, at rate 10, 30s

            "status_codes": [
                {
                    "0": 300
                }
            ],
            "errors": [
                [
                    "Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
                ]
            ],

@sslivkoff looks like flood maps timeouts to status code 0

sslivkoff commented 1 year ago

http status code 0 on the timeouts is just a filler because there is no status code due to no response ever being received

mattmalec commented 1 year ago

what would you suggest for next steps? it seems like we're limited to 10k requests and all of those 429, then timeout. even with the limit set significantly higher

bemevolent commented 1 year ago

Was running into the same - if the consensus client is killed first and then flood run against reth, it can handle 10,000 one run after another no problem. So my guess is the underlying issue is either something blocking on the auth rpc service, or resource contention with the massive amount of try_insert_validated_block being done from consensus

mattsse commented 1 year ago

I'll bump prio on perf improvements here

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

paradigmxyz / reth