Closed arnetheduck closed 2 years ago
Remember to match this sampling rate in Grafana's datasource config ("scrape interval" field). It's important for allowing some global vars to work correctly.
I guess the argument for a higher granularity would be peer / mem tracking but then it still would make sense to use multiples and divisors of 12
actually - 6s might be better - @jakubgs is this doable?
I'm sure it's doable, but considering how unresponsive and barely usable the Beacon node dashboard already is adding more data points will make it even worse. But sure, lets do it.
The metrics are collected due to this config: https://github.com/status-im/infra-hq/blob/1ce12670/ansible/roles/prometheus-slave/templates/prometheus.yml.j2#L93-L99
It should be as simple as adding a scrape_interval
to that job definition:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
Deployed in: https://github.com/status-im/infra-hq/commit/20e0e333
Seems to be working fine, at least on the scraping side:
One host is not responding because I rebuilt libp2p
locally without -d:insecure
.
I'll leave adjusting the dashboard to @stefantalpalaru since he seems to be managing that.
It's not the dashboard that needs adjusting, but the data source in the Grafana configuration and I don't have access to that for metrics.status.im.
Okay, then done:
Looks like this caused our Prometheus + Cortex setup to choke:
I'm reverting the change for now: https://github.com/status-im/infra-hq/commit/1ba8be54
Until I can figure out what caused this and tune some settings in https://github.com/status-im/infra-hq/issues/32.
I did some tuning of Prometheus in https://github.com/status-im/infra-hq/commit/61f523c1, so I'm trying to lower the scrape interval from default 30 seconds to 15. We'll see that that kills it again. Then we can try going lower.
Don't forget to change the data source config in Grafana.
There were some dropped samples but it looks fine now:
And a small spike in pending samples, but it returned to normal:
The number of samples cortex gets went up but the latency looks fine.
Most importantly flush queue looks good:
I'll let it go for the night and try lower scrape interval tomorrow.
@stefantalpalaru I don't know if it makes sense to change the scrape interval in the config for the whole source if we just changed it for one specific service. Not really sure what it affects in terms of graphs and how it would impact those that are still using 30s scrape interval.
This is why it's important for Grafana to know its Prometheus source's sampling rate: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/
I see.
I went down to 12s interval, looking fine so far:
I tried going from 12 seconds to 6 seconds scrape interval again and it just fails so I reverted it:
Prometheus was getting 400s and 500s from Cortex and failing to push samples:
Seems like master-01
was the only mostly returning 400s and 500s later:
When I applied the change I started getting these errors in Cortex:
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"process_start_time_seconds\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"process_resident_memory_bytes\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"process_virtual_memory_bytes\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"process_cpu_seconds_total\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"af90d598\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"ad61fd55\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"ad157c93\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.141, incoming timestamp: 1614092623.141 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-02.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"b0c4ac47\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092641.171, incoming timestamp: 1614092623.16 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-stable-small\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"stable-small-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"8b38db0f\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092641.171, incoming timestamp: 1614092623.16 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-stable-small\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"stable-small-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"88018fba\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092641.171, incoming timestamp: 1614092623.16 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-stable-small\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"stable-small-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"872007a5\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092641.171, incoming timestamp: 1614092623.16 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-stable-small\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"stable-small-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"86b03bce\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a920ae83\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a43b0c0a\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a5c7bf75\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a63c37f8\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"aa797ad5\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a71a4516\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"ac4d0fe5\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
sample timestamp out of order; last timestamp: 1614092647.128, incoming timestamp: 1614092623.128 for series {__name__=\"attached_validator_balance_created\", container=\"beacon-node-pyrmont-unstable-large\", datacenter=\"aws-eu-central-1a\", fleet=\"nimbus.pyrmont\", group=\",nimbus.pyrmont,beacon,nimbus,metrics,\", instance=\"unstable-large-01.aws-eu-central-1a.nimbus.pyrmont\", job=\"beacon-node-metrics\", pubkey=\"a3e29b63\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
Which suggests to me that 6 seconds is so frequent that the second scrape doesn't actually scrape any new metrics.
It seems like tha main culprits are attached_validator_balance
and attached_validator_balance_created
:
admin@master-01.do-ams3.metrics.hq:~ % sudo journalctl --since '1 hour ago' -a -u cortex | awk '/nimbus.pyrmont/{match($0, /__name__=\\"([^"]+)\\"/, r); a[r[1]]++}END{for(x in a){printf "%5d - %s\n", a[x], x}}' | sort -h | tail
7 - netdata_services_throttle_io_read_KiB_persec_average
7 - nim_gc_mem_bytes
7 - process_start_time_seconds
8 - libp2p_pubsub_received_ihave_created
8 - netdata_apps_mem_MiB_average
8 - process_resident_memory_bytes
11 - process_virtual_memory_bytes
15 - process_cpu_seconds_total
688 - attached_validator_balance_created
690 - attached_validator_balance
Which would suggest to me that the code that returns values for validator balances don't return new values frequently enough, or something like that.
Well whatever, for now I made it 12 seconds(https://github.com/status-im/infra-hq/commit/001841f6) and I'm gonna leave it at that for now.
cc @jakubgs what's the current setting, and does it match the grafana configuration? if yes, we could close this I believe?
The current setting is 12 seconds:
{{ job('beacon-node-metrics', interval='12s') }}
And it is changed on metrics.status.im
, but not on grafana.status.im
, because that's mostly not using 12s intervals:
In eth2, things happen at a 12s cadence, so it makes sense that monitoring tools poll once every 12s, instead of the default 15s - this should make the graphs line up better with fewer spikes / irregularities that are caused by lack of timing sympathy.