skoef / birdwatcher

healthchecker for BIRD-anycasted services
10 stars 3 forks source link

prometheus exporter #17

Closed bad3bs closed 7 months ago

bad3bs commented 10 months ago

It would be greate to add prometheus exporter, like in anycast-healthchecker.

skoef commented 10 months ago

That would be a nice feature indeed! Which metrics would you like to see? I'm not familiar with the metrics from anycast-healthchecker

skoef commented 10 months ago

I've taken a look at anycast-healtchecker and came up with the following metrics:

# HELP birdwatcher_service_fail_total Number of failed probes per service
# TYPE birdwatcher_service_fail_total counter
birdwatcher_service_fail_total{service="foo"} 20
# HELP birdwatcher_service_health Current health state per service
# TYPE birdwatcher_service_health gauge
birdwatcher_service_health{service="foo"} 0
# HELP birdwatcher_service_success_total Number of successful probes per service
# TYPE birdwatcher_service_success_total counter
birdwatcher_service_success_total{service="foo"} 6
# HELP birdwatcher_service_transition_total Number of transitions per service
# TYPE birdwatcher_service_transition_total counter
birdwatcher_service_transition_total{service="foo"} 3

Which means the service foo has failed 20 times in total, is currently down, had 6 successes and transitioned 3 times from down to up or the other way around.

bad3bs commented 10 months ago

yeah, but maybe more detail:

# HELP birdwatcher_service The configured service checks
# TYPE birdwatcher_service gauge
birdwatcher_service{ip_prefix="10.10.10.10/32",service_name="foo"} 1.0
birdwatcher_service{ip_prefix="20.20.20.20/32",service_name="bar"} 1.0
# HELP birdwatcher_service_state The status of the service check: 1 = healthy, 0 = unhealthy
# TYPE birdwatcher_service_state gauge
birdwatcher_service_state{ip_prefix="10.10.10.10/32",service_name="foo"} 1.0
birdwatcher_service_state{ip_prefix="20.20.20.20/32",service_name="bar"} 0.0
# HELP birdwatcher_service_check_ip_assignment Service IP assignment check: 0 = not assigned, 1 = assigned
# TYPE birdwatcher_service_check_ip_assignment gauge
birdwatcher_service_check_ip_assignment{ip_prefix="10.10.10.10/32",service_name="foo"} 1.0
birdwatcher_service_check_ip_assignment{ip_prefix="20.20.20.20/32",service_name="bar"} 0.0
# HELP birdwatcher_service_check_timeout_total The number of times a service check timed out
# TYPE birdwatcher_service_check_timeout_total counter
birdwatcher_service_check_timeout_total{ip_prefix="10.10.10.10/32",service_name="foo"} 2.0
birdwatcher_service_check_timeout_total{ip_prefix="20.20.20.20/32",service_name="bar"} 10.0
bad3bs commented 10 months ago

and

# HELP birdwatcher_service_check_duration_milliseconds Service check duration in milliseconds
# TYPE birdwatcher_service_check_duration_milliseconds gauge
birdwatcher_service_check_duration_milliseconds{ip_prefix="10.10.10.10/32",service_name="foo"} 5.141496658325195
skoef commented 10 months ago

The initial reason I started building birdwatcher was because I had a desire to have many prefixes per service, where this didn't scale too wel in anycast-healthchecker. So adding the prefixes as labels to the metrics wouldn't have my preference since it could easily be just too many. Or do you have a specific use-case for the prefixes in the metrics?

Apart from that, here are my thoughts on your proposed metrics:

Best of wishes for 2024!

bad3bs commented 10 months ago

Happy new year! :)

skoef commented 10 months ago
  • prefix as label need when filter metrics, or when draw graphs to write annotations, etc.

Instead of monitor/graph the service state for a given prefix, why not monitor (or graph) the service it is assigned to. Anycast-healthchecker has 1 prefix per service so it makes sense to add that as a label as well, with birdwatcher you'll end up with many metrics, one for every prefix/service combination.

  • timeout needed, like canary checker, to understand what happens with checked service.

Well I guess we could have a timeout counter as well, but since a timeout counts as a failed service check as well (and would therefor increase the failure counter for that service), what added value would the timeout counter have to the failure counter?

  • "IP assignment check" shows that config added and bird configured, not just checks timeouts.

So, you want to have a metric reflecting the configuration of the service? I can live with a metric like

HELP birdwatcher_service_info Services and their configuration
TYPE birdwatcher_service_info gauge
birdwatcher_service_info{name="foo", rise=1, fall=3, prefixes=6, interval=1, timeout=10} 1

kind of metric.

bad3bs commented 10 months ago

Instead of monitor/graph the service state for a given prefix, why not monitor (or graph) the service it is assigned to. Anycast-healthchecker has 1 prefix per service so it makes sense to add that as a label as well, with birdwatcher you'll end up with many metrics, one for every prefix/service combination.

Yeah, but with prefix in label, not just service name, more comfortable to write queries in some scenarios.

Well I guess we could have a timeout counter as well, but since a timeout counts as a failed service check as well (and would therefor increase the failure counter for that service), what added value would the timeout counter have to the failure counter?

More information and easier to analyze what is happening.

So, you want to have a metric reflecting the configuration of the service? I can live with a metric like

HELP birdwatcher_service_info Services and their configuration
TYPE birdwatcher_service_info gauge
birdwatcher_service_info{name="foo", rise=1, fall=3, prefixes=6, interval=1, timeout=10} 1

kind of metric.

Looks good :)

skoef commented 10 months ago

@bad3bs this is what I came up with, I think this is a fair compromis:

# HELP birdwatcher_prefix_state Current health state per prefix
# TYPE birdwatcher_prefix_state gauge
birdwatcher_prefix_state{prefix="192.168.0.0/24",service="foo"} 0
birdwatcher_prefix_state{prefix="192.168.1.0/24",service="foo"} 0
birdwatcher_prefix_state{prefix="192.168.2.0/24",service="bar"} 1
# HELP birdwatcher_service_check_duration Service check duration in milliseconds
# TYPE birdwatcher_service_check_duration gauge
birdwatcher_service_check_duration{service="bar"} 9.139375e+06
birdwatcher_service_check_duration{service="foo"} 1.2018245208e+10
# HELP birdwatcher_service_fail_total Number of failed probes per service
# TYPE birdwatcher_service_fail_total counter
birdwatcher_service_fail_total{service="foo"} 1
# HELP birdwatcher_service_info Services and their configuration
# TYPE birdwatcher_service_info gauge
birdwatcher_service_info{command="check.sh",fail="1",function_name="match_route",interval="1",rise="1",service="foo",timeout="10s"} 1
birdwatcher_service_info{command="/usr/bin/true",fail="1",function_name="match_route",interval="1",rise="1",service="bar",timeout="10s"} 1
# HELP birdwatcher_service_state Current health state per service
# TYPE birdwatcher_service_state gauge
birdwatcher_service_state{service="bar"} 1
birdwatcher_service_state{service="foo"} 0
# HELP birdwatcher_service_success_total Number of successful probes per service
# TYPE birdwatcher_service_success_total counter
birdwatcher_service_success_total{service="bar"} 30
birdwatcher_service_success_total{service="foo"} 10
# HELP birdwatcher_service_timeout_total Number of timed out probes per service
# TYPE birdwatcher_service_timeout_total counter
birdwatcher_service_timeout_total{service="foo"} 1
# HELP birdwatcher_service_transition_total Number of transitions per service
# TYPE birdwatcher_service_transition_total counter
birdwatcher_service_transition_total{service="bar"} 1
birdwatcher_service_transition_total{service="foo"} 2
skoef commented 10 months ago

@bad3bs you can use the newly tagged version: 1.0.0-beta3. Just make note of the upgrading instructions!