near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.3k stars 600 forks source link

Add shard_id to the near_transaction_processed_total metric #11682

Open jancionear opened 1 week ago

jancionear commented 1 week ago

The near_transaction_processed_total describes the number of transactions processed on some node:

pub static TRANSACTION_PROCESSED_TOTAL: Lazy<IntCounter> = Lazy::new(|| {
    try_create_int_counter(
        "near_transaction_processed_total",
        "The number of transactions processed since starting this node",
    )
    .unwrap()
});

It's currently used to calculate the total network TPS (transactions per second) by averaging this metric over all nodes.

When all nodes track all shards and process all transactions this works fine, but soon we will move to stateless validation and single shard tracking, where a node doesn't track all shards and doesn't process all the transactions. This means that the calculated TPS metric will be lower than reality - each node might process only 1/num_shards of transactions from some shards, and then an averaged aggregation will give us only 1/num_shards * total_tps.

We could fix it by adding a shard_id label. Then we could have a query which averages near_transaction_processed_total on each shard and adds them together to get the total TPS in an accurate way.

The same should be done for other related metrics (near_transaction_processed_successfully_total, ...)

The only problem with adding a shard_id is that it'd break backwards compatibility. near_transaction_processed_total is a fundamental metric which might be read from other code to gather statistics. Adding a shard_id could break this code. for example: https://github.com/near/nearcore/blob/48201f8da135bdd23a04eed51905b5cc3f173292/scripts/ft-benchmark-data-sender.py#L23

We could add a new metric with shard_id to maintain backwards compatibility. OTOH any code that used this metric to calculate TPS will be wrong soon, so maybe it's better for other code to fail and force the author to use the new metric properly.

Refs: zulip conversation

nagisa commented 4 days ago

I suggest not only to add a shard_id to the metric, but also rename the metric itself. Then we don't need to worry about old code seeing multiple near_transaction_processed_total metrics, picking a random one and rolling with it.