prometheus-operator / kube-prometheus

Use Prometheus to monitor Kubernetes and applications running on Kubernetes
https://prometheus-operator.dev/
Apache License 2.0
6.67k stars 1.92k forks source link

Consider integrating goldpinger into kube-prometheus #646

Open lilic opened 4 years ago

lilic commented 4 years ago

goldpinger is a Debugging tool for Kubernetes which tests and displays connectivity between nodes in the cluster. It also provides metrics out of the box, which is why it would be nice to integrate into kube-prometheus.

It already has the grafana dashboard. They also seem to be open to contributions around their alerts. <3

I tried it out on a multi node local Kubernetes cluster and all metrics I got are at the bottom of this issue. I found the goldpinger_nodes_health_total the most useful here, as well as the goldpinger_peers_response_time_s_ histogram which is a "Histogram of response times from other hosts, when making peer calls". The second one might be interesting for any SLOs we might want to do around nodes.

Metrics dump:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.8399e-05
go_gc_duration_seconds{quantile="0.25"} 4.0884e-05
go_gc_duration_seconds{quantile="0.5"} 0.000160089
go_gc_duration_seconds{quantile="0.75"} 0.000325157
go_gc_duration_seconds{quantile="1"} 0.039619188
go_gc_duration_seconds_sum 0.043393583
go_gc_duration_seconds_count 18
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 20
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14.1"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 6.03264e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.0318832e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.454872e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 240592
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 4.179914147615419e-05
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.582216e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 6.03264e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 5.7810944e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 8.577024e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 31191
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 5.742592e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.6387968e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.597312501650434e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 271783
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3472
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 112200
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 131072
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.058872e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 664800
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 720896
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 720896
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.2958208e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 9
# HELP goldpinger_kube_master_response_time_s Histogram of response times from kubernetes API server, when listing other instances
# TYPE goldpinger_kube_master_response_time_s histogram
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.005"} 5
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.01"} 41
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.025"} 58
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.05"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.1"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.25"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="0.5"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="1"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="2.5"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="5"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="10"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="30"} 61
goldpinger_kube_master_response_time_s_bucket{goldpinger_instance="kind-worker2",le="+Inf"} 61
goldpinger_kube_master_response_time_s_sum{goldpinger_instance="kind-worker2"} 0.6124340499999998
goldpinger_kube_master_response_time_s_count{goldpinger_instance="kind-worker2"} 61
# HELP goldpinger_nodes_health_total Number of nodes seen as healthy/unhealthy from this instance's POV
# TYPE goldpinger_nodes_health_total gauge
goldpinger_nodes_health_total{goldpinger_instance="kind-worker2",status="healthy"} 2
goldpinger_nodes_health_total{goldpinger_instance="kind-worker2",status="unhealthy"} 0
# HELP goldpinger_peers_response_time_s Histogram of response times from other hosts, when making peer calls
# TYPE goldpinger_peers_response_time_s histogram
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.005"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.01"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.025"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.05"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.1"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.25"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="1"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="2.5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="10"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="30"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="+Inf"} 1
goldpinger_peers_response_time_s_sum{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2"} 0.002451531
goldpinger_peers_response_time_s_count{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.005"} 0
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.01"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.025"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.05"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.1"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.25"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="1"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="2.5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="5"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="10"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="30"} 1
goldpinger_peers_response_time_s_bucket{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="+Inf"} 1
goldpinger_peers_response_time_s_sum{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2"} 0.005027892
goldpinger_peers_response_time_s_count{call_type="check",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2"} 1
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.005"} 57
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.01"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.025"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.05"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.1"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.25"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="0.5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="1"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="2.5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="10"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="30"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2",le="+Inf"} 58
goldpinger_peers_response_time_s_sum{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2"} 0.087007868
goldpinger_peers_response_time_s_count{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.3",pod_ip="10.244.2.2"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.005"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.01"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.025"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.05"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.1"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.25"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="0.5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="1"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="2.5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="5"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="10"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="30"} 58
goldpinger_peers_response_time_s_bucket{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2",le="+Inf"} 58
goldpinger_peers_response_time_s_sum{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2"} 0.13343217099999996
goldpinger_peers_response_time_s_count{call_type="ping",goldpinger_instance="kind-worker2",host_ip="172.17.0.4",pod_ip="10.244.1.2"} 58
# HELP goldpinger_stats_total Statistics of calls made in goldpinger instances
# TYPE goldpinger_stats_total counter
goldpinger_stats_total{action="check",goldpinger_instance="kind-worker2",group="made"} 2
goldpinger_stats_total{action="check",goldpinger_instance="kind-worker2",group="received"} 1
goldpinger_stats_total{action="check_all",goldpinger_instance="kind-worker2",group="received"} 1
goldpinger_stats_total{action="healthz",goldpinger_instance="kind-worker2",group="received"} 700
goldpinger_stats_total{action="ping",goldpinger_instance="kind-worker2",group="made"} 116
goldpinger_stats_total{action="ping",goldpinger_instance="kind-worker2",group="received"} 116
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1.61
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.7807744e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.59731082005e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.47614208e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 5
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
brancz commented 4 years ago

Super curious about some details, but I think adding this is an amazing idea. It's baseline monitoring that everyone should have.

lilic commented 4 years ago

@brancz what kind of details? Maybe I can answer or explore of them?

brancz commented 4 years ago

I'm a little curious about the cardinality of this, it looks like this would end up being O(n^2) series, as each host reports on each host. That could get expensive quickly, let's say with 10k nodes.

j4ckstraw commented 2 years ago

can goldpinger split nodes into zones?