RPC responses lags every 10 minutes

gr8den commented 3 days ago

Describe the bug

I have monitoring service which query latest block from rpc every 1 second. Timeout for this query is 1 second. I notice that reth node cannot response to requests within 1s timeout sometimes and it occurs every 10 minutes. This lags last for ~3 seconds for reth archive ethereum node. Is there some blocking interval task in Reth? Any ideas how to fix this?

Monitoring image. Peaks are timeouts. (Initially it's chart to monitoring lag between last block time and current time so don't pay attention to 60000 ms, I mark any timeout requests as 60000 lag)

Also I noticed that this lags are visible on Max transaction open time chart

UTC time in both logs and screenshots Node is fully synced Freelist ~30,000 BSC reth archive node have same problem but lags ~6 seconds instead of ~3 seconds with same 10 minute interval

It launched in Docker version 27.3.1, build ce12230 Linux, Ubuntu, Self-hosted

Specs of SSD: https://www.adata.com/en/consumer/category/ssds/solid-state-drives-legend-960/?tab=specification (other SSDs are used too, but problem is same)

Steps to reproduce

just run reth archive node

      node
      --chain mainnet
      --metrics 0.0.0.0:9001
      --log.file.directory /root/logs
      --authrpc.addr 0.0.0.0
      --authrpc.port 8551
      --authrpc.jwtsecret /root/jwt/jwt.hex
      --http --http.addr 0.0.0.0 --http.port 8545
      --http.api "eth,net,web3,debug"
      --http.corsdomain "*"
      --rpc-max-logs-per-response 1000000

Node logs

reth-debug-logs-2024-11-29T06to07.txt

Platform(s)

Linux (x86)

What version/commit are you on?

reth Version: 1.1.0-dev Commit SHA: 1ba631ba9581973e7c6cadeea92cfe1802aceb4a Build Timestamp: 2024-11-18T05:56:49.513902613Z Build Features: jemalloc Build Profile: release

What database version are you on?

Current database version: 2 Local database version: 2

Which chain / network are you on?

mainnet

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

[x] I agree to follow the Code of Conduct

mattsse commented 3 days ago

query latest block from rpc every 1 second

query how? eth_getBlock? latest or by hash/number?

gr8den commented 3 days ago

query latest block from rpc every 1 second

query how? eth_getBlock? latest or by hash/number?

{"jsonrpc":"2.0","method": "eth_getBlockByNumber", "params": ["latest", false],"id":74}

mattsse commented 3 days ago

hmm, interesting that this appears to be the same 6s delay

could you perhaps check your rpc panel, ref:

https://reth.paradigm.xyz/d/2k8BXz24x/reth?orgId=1&refresh=30s&viewPanel=120&from=now-24h&to=now

and check if you can spot the same 6s spikes there?

EDIT:

I mark any timeout requests as 60000 lag

I see, so they just time out indefinitely?

gr8den commented 3 days ago

and check if you can spot the same 6s spikes there?

yes, i see some of them: screenshot

I see, so they just time out indefinitely?

I need to change my code to get actual time of response (or absence of any response). At this point it returns error if there is no response in 1 second

Monitoring code looks like:

result = get_last_block_or_return_error_after_1s_timeout()
if result is error:
  return 60000
else:
  return min(60000, now() - result.blocktime_in_ms)

mattsse commented 3 days ago

ty,

could you please also post get_last_block_or_return_error_after_1s_timeout

gr8den commented 3 days ago

ty,

could you please also post get_last_block_or_return_error_after_1s_timeout

TypeScript:

    const res = await fetch('http://reth:8545', {
      method: 'POST',
      headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
      },
      body: '{"jsonrpc":"2.0","method": "eth_getBlockByNumber", "params": ["latest", false],"id":74}',
      keepalive: true,
      signal: AbortSignal.timeout(1000),
    });
    const data = await res.json() as { result: { timestamp: string, number: string } };
    return data;

Also I have bsc geth full node which is monitored in same way. It don't have any problems with it

mattsse commented 5 hours ago

ah okay, so this would actually return 6000ms if the rpc response is an error?

mattsse commented 4 hours ago

ah I think you're encountering a reorg bug

mattsse commented 4 hours ago

BSC reth archive node have same problem but lags ~6 seconds instead of ~3 seconds with same 10 minute interval

can you elaborate on where you got those numbers from?

you're also not tracking rpc duration it seems: (now() - result.blocktime_in_ms)

paradigmxyz / reth