Open jancionear opened 1 week ago
I suggest not only to add a shard_id
to the metric, but also rename the metric itself. Then we don't need to worry about old code seeing multiple near_transaction_processed_total
metrics, picking a random one and rolling with it.
The
near_transaction_processed_total
describes the number of transactions processed on some node:It's currently used to calculate the total network TPS (transactions per second) by averaging this metric over all nodes.
When all nodes track all shards and process all transactions this works fine, but soon we will move to stateless validation and single shard tracking, where a node doesn't track all shards and doesn't process all the transactions. This means that the calculated TPS metric will be lower than reality - each node might process only
1/num_shards
of transactions from some shards, and then an averaged aggregation will give us only1/num_shards * total_tps
.We could fix it by adding a
shard_id
label. Then we could have a query which averagesnear_transaction_processed_total
on each shard and adds them together to get the total TPS in an accurate way.The same should be done for other related metrics (
near_transaction_processed_successfully_total
, ...)The only problem with adding a
shard_id
is that it'd break backwards compatibility.near_transaction_processed_total
is a fundamental metric which might be read from other code to gather statistics. Adding ashard_id
could break this code. for example: https://github.com/near/nearcore/blob/48201f8da135bdd23a04eed51905b5cc3f173292/scripts/ft-benchmark-data-sender.py#L23We could add a new metric with
shard_id
to maintain backwards compatibility. OTOH any code that used this metric to calculate TPS will be wrong soon, so maybe it's better for other code to fail and force the author to use the new metric properly.Refs: zulip conversation