Closed erikreppel closed 1 year ago
just to summarize, the issue is that you get 429 for rates lower than the configured --rpc-max-connections?
can you try this with a single --rates 1000
one issue here is that the server is not necessarily back at 0 if the next flood run for the next rate is started, because vegeta just drops the connection if a run finished, but the response for that dropped request can be processed atm
correct, that the issue.
Running as a single run seems to have helped, but looks like at even --rates 100
we're hitting too many open files issues
Increasing the soft limit of open files seems to have helped, but now seeing the 429 issue again
flood eth_getLogs RETH=http://127.0.0.1:8545 --rates 1000 --duration 30
...
"status_codes": [
{
"0": 4955,
"200": 29,
"429": 25016
}
],
"errors": [
[
"429 Too Many Requests",
"Post \"http://127.0.0.1:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
]
],
...
while we raise the fd limit, this will pick the hard resource limit if it's lower than the actual kernel limit. ulimit -Hn
but you can increase this manually, I believe it's in /etc/security/limits.conf or via sysctl
Currently have
ubuntu@reth:~$ ulimit -Sn
1048576
ubuntu@reth:~$ cat /proc/sys/fs/file-max
9223372036854775807
ubuntu@reth:~$ ulimit -Hn
1048576
that should be plenty,
429 is only handled after the request has been received
yep. Any insight on the cause of the 429s? Resource utilization on the instance is still quite low
need to check if the limits are set correctly but the 429 is only ever used here
how many 429 do you get if --rpx-max-conn matches the --rate?
only a small % of requests go through if the max conns are set to the rate - in this case, they were both 1000, and our duration was 30 seconds
{
"200": 3252,
"429": 26747
}
if i set the max conns to 1 million, the requests never actually go through
"status_codes": [
{
"0": 29991,
"200": 9
}
],
"errors": [
[
"Post \"http://localhost:8545\": dial tcp 0.0.0.0:0->[::1]:8545: connect: connection refused",
"Post \"http://localhost:8545\": dial tcp 0.0.0.0:0->[::1]:8545: bind: address already in use",
"Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
]
]
combined these are 30k which matches 30 * 1000 or 1000 per second,
so this is the same issue as mentioned above,
I'm not exactly sure if the connection handler is able to drop the connection and cancel the request if the client never sends a FIN
need to look into this
status code 0 is not sent from the server, this must be something else like
bind: address already in use
status code 0 is not sent from the server, this must be something else like
im curious where that would be coming from then - the only outlier is the max conns arg being 1000 vs 1 million
also interesting find, i set the max conns to 20k and i see this:
"status_codes": [
{
"0": 19991,
"200": 7,
"429": 10002
}
],
"errors": [
[
"429 Too Many Requests",
"Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
]
],
Okay, I ran some tests with an endpoint intentionally exceeding the 30s timeout used in flood, at rate 10, 30s
"status_codes": [
{
"0": 300
}
],
"errors": [
[
"Post \"http://localhost:8545\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
]
],
@sslivkoff looks like flood maps timeouts to status code 0
http status code 0 on the timeouts is just a filler because there is no status code due to no response ever being received
what would you suggest for next steps? it seems like we're limited to 10k requests and all of those 429, then timeout. even with the limit set significantly higher
Was running into the same - if the consensus client is killed first and then flood
run against reth
, it can handle 10,000 one run after another no problem. So my guess is the underlying issue is either something blocking on the auth rpc service, or resource contention with the massive amount of try_insert_validated_block
being done from consensus
I'll bump prio on perf improvements here
This issue is stale because it has been open for 14 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug
I've noticed while doing some load testing against our reth node that we get 429 responses while resource utilization is still quite low, across a variety of different configurations
full test output:
Steps to reproduce
run reth with
--rpc-max-connections
as set in above results,flood eth_getLogs RETH=http://reth:8545 --rates 10 100 250 500 750 1000 --duration 30
also observable on
flood eth_getBlockByNumber RETH=http://reth:8545 --rates 10 100 250 500 750 1000 --duration 30
Node logs
No response
Platform(s)
Linux (x86)
What version/commit are you on?
reth/v0.1.0-alpha.3-31af4d5/x86_64-unknown-linux-gnu
What database version are you on?
If you've built Reth from source, provide the full command you used
No response
Code of Conduct