Our node does not perform as good as other nodes

sig-net / mpc

6 stars 0 forks source link

Our node does not perform as good as other nodes #1

Open volovyks opened 1 week ago

volovyks commented 1 week ago

Description

Such behavior was explored on Testnet and Mainnet. It can lead to failures in all protocols.

volovyks commented 1 week ago

PS, it is a modified Dashboard, I will add it soon.

auto-mausx commented 1 week ago

So I did notice this started when we moved our node over, I'm not sure if the fact that our node is technically running on a shorter timeframe than the others since we destroyed our node and rebuilt it. I attributed it to that, so perhaps it is the way the metric is exported.

Just for clarity sake, this node is the exact same machine size, disk size, and networking configuration as the rest of the partner nodes. I mirrored the environment from Pagoda 1 for 1 just to avoid any issues.

auto-mausx commented 1 week ago

Here's my theory:

This line of code controls the increment of that metric count

crate::metrics::PROTOCOL_ITER_CNT
                .with_label_values(&[my_account_id.as_str()])
                .inc();

I hypothesize that grafana calculates the rate per hour (increase()) by dividing the total count by 60 mins. So since our node is "newer" than the other nodes, there will be significant difference between the total number of iterations from all other nodes to this node. There are months of iterations on the other nodes, and we only have about 27 days worth of iterations.

That is also the reason the other nodes are not exactly aligned with each other, since it took about a week for all of our partners to update.

volovyks commented 1 week ago

Let's see how it will behave after the release. I hope increase means how many new iterations happened in the last hour.

auto-mausx commented 1 week ago

That is what the docs says it means, so maybe we do have an issue. I am not sure what that may be though.

https://prometheus.io/docs/prometheus/latest/querying/functions/#increase