Load Balanced RPC redirecting to out of sync nodes

skalenetwork / skale-proxy

SKALE Proxy is high performance, easy-to-run public service that provides proxied and load-balanced JSON-RPC endpoints for SKALE chains. It is based on NGINX.

GNU Affero General Public License v3.0

2 stars 3 forks source link

Load Balanced RPC redirecting to out of sync nodes #70

Open yohanelly95 opened 2 months ago

yohanelly95 commented 2 months ago

Validators using the load-balanced RPC were sometimes redirected to a faulty/out-of-sync RPC, which consistently returned a stale block number, leading to validator downtime. Restarting the validator node did not fix it. The out-of-sync RPC was found to be https://skale-node2.01node.com:10136/, but it could change over time.

load-balanced RPC: https://mainnet.skalenodes.com/v1/turbulent-unique-scheat

dmytrotkk commented 2 months ago

Thanks for opening this issue, @yohanelly95. We’ll look into it.

For context: we cannot check if a node is synced on each call, as it would add significant overhead to the Nginx proxy. Instead, we check block timestamps every three hours and remove out-of-sync endpoints from the rotation.

Here’s how it works: we check the block timestamps on all endpoints, identify the highest one, and compare it to the others. The maximum allowed slippage is 300 seconds (5 minutes). Given the average block frequency of 10.5 seconds per block (for Razor chain), this could result in an approximate 28-block outage.

https://github.com/skalenetwork/skale-proxy/blob/41b1e887cbb573468c3915287b962290cbf40661/proxy/endpoints.py#L69

ALLOWED_TIMESTAMP_DIFF = 300

We will come back to you after additional checks on our side.

yohanelly95 commented 3 weeks ago

Thanks for opening this issue, @yohanelly95. We’ll look into it.

For context: we cannot check if a node is synced on each call, as it would add significant overhead to the Nginx proxy. Instead, we check block timestamps every three hours and remove out-of-sync endpoints from the rotation.

Here’s how it works: we check the block timestamps on all endpoints, identify the highest one, and compare it to the others. The maximum allowed slippage is 300 seconds (5 minutes). Given the average block frequency of 10.5 seconds per block (for Razor chain), this could result in an approximate 28-block outage.

https://github.com/skalenetwork/skale-proxy/blob/41b1e887cbb573468c3915287b962290cbf40661/proxy/endpoints.py#L69
ALLOWED_TIMESTAMP_DIFF = 300
We will come back to you after additional checks on our side.

Hey @dmytrotkk! Are there any updates on how we can handle this