paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.8k stars 652 forks source link

cumulus/minimal-node: added prometheus metrics for the RPC client #5572

Closed iulianbarbu closed 1 week ago

iulianbarbu commented 3 weeks ago

Description

When we start a node with connections to external RPC servers (as a minimal node), we lack metrics around how many individual calls we're doing to the remote RPC servers and their duration. This PR adds metrics that measure durations of each RPC call made by the minimal nodes, and implicitly how many calls there are.

Closes #5409 Closes #5689

Integration

Node operators should be able to track minimal node metrics and decide appropriate actions according to how the metrics are interpreted/felt. The added metrics can be observed by curl'ing the prometheus metrics endpoint for the ~relaychain~ parachain (it was changed based on the review). The metrics are represented by ~polkadot_parachain_relay_chain_rpc_interface~ relay_chain_rpc_interface namespace (I realized lining up parachain_relay_chain in the same metric might be confusing :). Excerpt from the curl:

relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24
relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075
relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24

Review Notes

The way we measure durations/hits is based on HistogramVec struct which allows us to collect timings for each RPC client method called from the minimal node., It can be extended to measure the RPCs against other dimensions too (status codes, response sizes, etc). The timing measuring is done at the level of the relay-chain-rpc-interface, in the RelayChainRpcClient struct's method 'request_tracing'. A single entry point for all RPC requests done through the relay-chain-rpc-interface. The requests durations will fall under exponential buckets described by start 0.001, factor 4 and count 9.

paritytech-cicd-pr commented 3 weeks ago

The CI pipeline was cancelled due to failure one of the required jobs. Job name: test-linux-stable-int Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7290328