The routing_table_nodes metric format is wrong

jakubgs commented 1 year ago

I have been seeing some errors on the metrics backend that started on the 17th of March:

Those errors are duplicate sample for timestamp and always are triggered for the routing_table_nodes metric:

ts=2023-04-04T07:33:20.944Z caller=dedupe.go:112 component=remote level=error remote_name=cortex url=http://host-gateway:19092/api/v1/push msg="non-recoverable error" count=1000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=fake: err: duplicate sample for timestamp. timestamp=2023-04-04T07:33:20.605Z, series={__name__=\"routing_table_nodes\", container=\"nimbus-fluffy-mainnet-master-25\", datacenter=\"he-eu-hel1\", fleet=\"nimbus.fluffy\", group=\",nimbus.fluffy,eth1,nimbus,metrics,\", instance=\"metal-02.he-eu-hel1.nimbus.fluffy\", job=\"nimbus-fluffy-metrics\", source=\"slave-01.he-eu-hel1.metrics.hq\"}"

If we look at the metric itself it appears to be broken:

admin@metal-01.he-eu-hel1.nimbus.fluffy:~ % c 0:9210/metrics | grep routing_table_nodes      
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes{state=""} 181.0
routing_table_nodes_created{state=""} 1680593730.0
routing_table_nodes{state="seen"} 89.0
routing_table_nodes_created{state="seen"} 1680593730.0
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes 0.0
routing_table_nodes_created 1680593730.0

As we can see it's listed twice. Once with state label, which has "" and "seen" values, but also as a metric without any labels.

This looks like a bug that was introduces somewhere around 17th of March.

kdeme commented 1 year ago

This might have to do with the fact that we use the same routing_table code for discv5 and each different Portal sub-network.

Aside from the weird double metric, the metric itself will also be incorrect as it will hold the data for all networks together.

We should find a way to split this routing_table_nodes metric per network / routing table instance.

jakubgs commented 1 year ago

Hah, I forgot about this. It would be nice to fix this, but for now I'll just drop this metric:

https://github.com/status-im/infra-hq/commit/e4cefa22 - prometheus-slave: drop broken routing_table_nodes metric

jakubgs commented 1 year ago

Actually, it appears this issue also exists in Nim-Waku nodes:

admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % grep metrics /docker/nim-waku-v2/docker-compose.yml    
      --metrics-server=True
      --metrics-server-port=8008
      --metrics-server-address=0.0.0.0
admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % c 0:8008/metrics | grep routing_table_nodes         
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes{state=""} 43.0
routing_table_nodes_created{state=""} 1689722335.0
routing_table_nodes{state="seen"} 43.0
routing_table_nodes_created{state="seen"} 1689722335.0
# HELP routing_table_nodes Discovery routing table nodes
# TYPE routing_table_nodes gauge
routing_table_nodes 0.0
routing_table_nodes_created 1689722327.0

So I've dropped those too:

https://github.com/status-im/infra-hq/commit/5eea37ef - prometheus-slave: drop more routing_table_nodes metrics

And opened an issue:

https://github.com/waku-org/nwaku/issues/1855

jakubgs commented 10 months ago

Is anyone going to fix it at any point? Hello?

kdeme commented 10 months ago

My initially quick assessment at https://github.com/status-im/nimbus-eth1/issues/1528#issuecomment-1640214271 was not the actually cause for this (although that work should still be done).

The cause is some import/export pollution of the discv4 routing table code. And although this code is not actually used, it has the same metric name (without the label). Quickfix for now is to rename that one, see https://github.com/status-im/nim-eth/pull/646

kdeme commented 10 months ago

Fix in https://github.com/status-im/nimbus-eth1/pull/1874

jakubgs commented 10 months ago

Thank you.

jakubgs commented 10 months ago

It appears the last instance of this error in Prometheus for ih-eu-mda1 DC was today at midnight:

admin@slave-01.ih-eu-mda1.metrics.hq:~ % zgrep routing_table_nodes /var/log/docker/prometheus-slave/docker.* | tail -n1
/var/log/docker/prometheus-slave/docker.log:2023-11-02T00:05:45.836145+00:00 docker/prometheus-slave[840]: ts=2023-11-02T00:05:45.835Z caller=dedupe.go:112 component=remote level=error remote_name=cortex url=http://host-gateway:19092/api/v1/push msg="non-recoverable error" count=1000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=fake: err: duplicate sample for timestamp. timestamp=2023-11-02T00:05:45.245Z, series={__name__=\"routing_table_nodes\", container=\"nimbus-fluffy-mainnet-master-27\", datacenter=\"ih-eu-mda1\", fleet=\"nimbus.fluffy\", group=\",nimbus.fluffy,eth1,nimbus,metrics,\", instance=\"metal-02.ih-eu-mda1.nimbus.fluffy\", job=\"nimbus-fluffy-metrics\", source=\"slave-01.ih-eu-mda1.metrics.hq\"}"

And the graph does show this too:

Which also matches with the build timer run on metal-01.ih-eu-mda1.nimbus.eth1:

admin@metal-01.ih-eu-mda1.nimbus.eth1:~ % j -n2 -u build-nimbus-eth1-goerli-master.service 
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: Finished Build nimbus-eth1-goerli-master.
Nov 02 00:02:53 metal-01.ih-eu-mda1.nimbus.eth1 systemd[1]: build-nimbus-eth1-goerli-master.service: Consumed 9min 24.656s CPU time.

So I consider this fixed. Thank you @kdeme .

status-im / nimbus-eth1

The routing_table_nodes metric format is wrong #1528