Closed jakubgs closed 10 months ago
This might have to do with the fact that we use the same routing_table code for discv5 and each different Portal sub-network.
Aside from the weird double metric, the metric itself will also be incorrect as it will hold the data for all networks together.
We should find a way to split this routing_table_nodes
metric per network / routing table instance.
Hah, I forgot about this. It would be nice to fix this, but for now I'll just drop this metric:
Actually, it appears this issue also exists in Nim-Waku nodes:
admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % grep metrics /docker/nim-waku-v2/docker-compose.yml
--metrics-server=True
--metrics-server-port=8008
--metrics-server-address=0.0.0.0
admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % c 0:8008/metrics | grep routing_table_nodes
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes{state=""} 43.0
routing_table_nodes_created{state=""} 1689722335.0
routing_table_nodes{state="seen"} 43.0
routing_table_nodes_created{state="seen"} 1689722335.0
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes 0.0
routing_table_nodes_created 1689722327.0
So I've dropped those too:
And opened an issue:
Is anyone going to fix it at any point? Hello?
My initially quick assessment at https://github.com/status-im/nimbus-eth1/issues/1528#issuecomment-1640214271 was not the actually cause for this (although that work should still be done).
The cause is some import/export pollution of the discv4 routing table code. And although this code is not actually used, it has the same metric name (without the label). Quickfix for now is to rename that one, see https://github.com/status-im/nim-eth/pull/646
Thank you.
It appears the last instance of this error in Prometheus for ih-eu-mda1
DC was today at midnight:
admin@slave-01.ih-eu-mda1.metrics.hq:~ % zgrep routing_table_nodes /var/log/docker/prometheus-slave/docker.* | tail -n1
/var/log/docker/prometheus-slave/docker.log:2023-11-02T00:05:45.836145+00:00 docker/prometheus-slave[840]: ts=2023-11-02T00:05:45.835Z caller=dedupe.go:112 component=remote level=error remote_name=cortex url=http://host-gateway:19092/api/v1/push msg="non-recoverable error" count=1000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=fake: err: duplicate sample for timestamp. timestamp=2023-11-02T00:05:45.245Z, series={__name__=\"routing_table_nodes\", container=\"nimbus-fluffy-mainnet-master-27\", datacenter=\"ih-eu-mda1\", fleet=\"nimbus.fluffy\", group=\",nimbus.fluffy,eth1,nimbus,metrics,\", instance=\"metal-02.ih-eu-mda1.nimbus.fluffy\", job=\"nimbus-fluffy-metrics\", source=\"slave-01.ih-eu-mda1.metrics.hq\"}"
And the graph does show this too:
Which also matches with the build timer run on metal-01.ih-eu-mda1.nimbus.eth1
:
admin@metal-01.ih-eu-mda1.nimbus.eth1:~ % j -n2 -u build-nimbus-eth1-goerli-master.service
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: Finished Build nimbus-eth1-goerli-master.
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: build-nimbus-eth1-goerli-master.service: Consumed 9min 24.656s CPU time.
So I consider this fixed. Thank you @kdeme .
I have been seeing some errors on the metrics backend that started on the 17th of March:
Those errors are
duplicate sample for timestamp
and always are triggered for therouting_table_nodes
metric:If we look at the metric itself it appears to be broken:
As we can see it's listed twice. Once with
state
label, which has""
and"seen"
values, but also as a metric without any labels.This looks like a bug that was introduces somewhere around 17th of March.