kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.46k stars 1.27k forks source link

Add metrics for realtime investigation of disconnected wl scenarios #7147

Open jayunit100 opened 1 year ago

jayunit100 commented 1 year ago

User Story

As a developer running CAPI in far separated networks, id like kubeadm controlplane manager to give me a delta between healthy nodes thats easy to read, i.e.

if i do a simple disconnect experiment on a WL cluster (w CP node w/ etcd on it),

sudo iptables -i eth0 -A INPUT -s 10.180.83.152 -j DROP 

I can see that disconnect is logged easily... and i cant easily ascertain how many nodes are / arent making that etcd connection (and yes, in general, i understand there are higher level declarative constructs and that using logs for everythign are an antipattern, but... in the real world, being able to see etcd client statistics, in realtime, is much more useful to quickly hypothesize a failure mode).

So my suggestion would be i think some kind of

failures := 0
for each etcd in my WL children:
   passed := check the etcd
   ip := etcd.ip 
   stats[ip] = passed 
   if !passed:
      failures += 1

log.Infof("etcd polling results: %v, failures %v", stats, failures)

in the logs ...

Desired output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
... 
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client... 
... 
I0911 14:00:29.382055       1 controller.go:553] etcd polling results 
[1.2.3.4: true, 
 1.2.3.5: false, 
 1.2.3.6: true
], 
 failures: 1 

Current output

I0901 14:00:29.382055       1 round_trippers.go:553] GET https://10.180.81.244:6443/api?timeout=10s  in 10000 milliseconds
E0901 14:00:29.382397       1 controller.go:188] controller/kubeadmcontrolplane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
E0901 14:00:29.384611       1 controller.go:317] controller/kubeadmcontrolplane "msg"="Reconciler error" "error"="[cannot get remote client to workload cluster: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), failed to create remote cluster client: error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)]" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"
I0901 14:00:29.385855       1 round_trippers.go:553] GET https://10.180.80.169:6443/?timeout=5s 200 OK in 3 milliseconds
I0901 14:00:29.388442       1 controller.go:251] controller/kubeadmcontrolplane "msg"="Reconcile KubeadmControlPlane" "cluster"="tkg-vc-antrea" "name"="tkg-vc-antrea-control-plane" "namespace"="default" "reconciler group"="controlplane.cluster.x-k8s.io" "reconciler kind"="KubeadmControlPlane"

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

killianmuldoon commented 1 year ago

Would it be better to raise a metric for this instead? Using logs to capture metrics seems like a strange approach.

jayunit100 commented 1 year ago

... logs work in small clusters that dont have full blown monitoring . in my case, that is likely the norm as opposed to the exception.

maybe we can capture in a prometheus metric and just print the metric out :)

fabriziopandini commented 1 year ago

I understand that not small clusters don't have full-blown monitoring, but at the same time I don't think we should build into our controllers/API a system to count failures across different reconcile because this comes with some immediate drawbacks

Accordingly, I'm +1 to address this using metrics

sbueringer commented 1 year ago

+1 Sounds reasonable.

Apparently there is an integration between grpc and Prometheus and etcd has an example on how to combine them for the etcd client: https://github.com/etcd-io/etcd/blob/main/tests/integration/clientv3/examples/example_metrics_test.go

jayunit100 commented 1 year ago

it doesnt require persistent state really, its just a heuristic. and a really its just a small hashtable at that :), and i dont think it needs to be perfect.

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

sbueringer commented 1 year ago

But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)

To be honest, printing metrics really sounds like an anti-pattern to me.

jayunit100 commented 1 year ago

antipattern

Touche but it has a low cost... like... maybe ten lines of code.... no functional changes to the API...

Agree nothing beats a lavish grafana dashboard but in chaotic situations a "dumb" alternative is needed....

But fair enough: What's the workaround for Admin people

Open to other creative ideas here

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

jayunit100 commented 1 year ago

@fabriziopandini prometheus histograms.... dont they already give us this for free in memory ? I don't see the persistent state corrallary. But I agree this shouldn't be an overblown crd / api level thing. I didn't mean to suggest that ... it's more like a logging point in time heuristic.

sbueringer commented 1 year ago

running hundreds of clusters not enabling prometheus at the mgmt layer?

They don't need Prometheus specifically, but running Kubernetes clusters at that scale in production without monitoring doesn't sound like a sane approach to me.

prometheus histograms.... dont they already give us this for free in memory ?

Yup they should. I think Fabrizio meant if we build our own instead of just using metrics.

it requires some persistent state to ensure counters survive controller restarts

If solved with normal metrics that's usually fine. For normal counters usually rates are used (e.g. error per minute vs absolute error count) and Prometheus handles it usually well if a counter starts from 0 after a restart. Not exactly sure how it works with histograms, but I don't expect problems there if we implement normal metrics as it's a very common case.

that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?

I think the first step should be to implement the regular metrics.

Then we can think of how users could consume them if they don't want to use Prometheus / a regular metrics tool. I could imagine folks could add a sidecar container which does a curl against the metrics endpoint regularly and prints the results. I'm not a big fan of it, but as that would be outside of the scope of Cluster API itself .... I'm fine with that.

jayunit100 commented 1 year ago

The answer to any logging problem can be better metrics and alerting, but logs are just a batteries included solution that work anywhere.

Why dont we split the difference:

first step should be to implement regular metrics

yup

fabriziopandini commented 1 year ago

renamed since there is agreement on the first step, which is implementing metrics helping to investigate connection problems /triage accepted

k8s-triage-robot commented 6 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini commented 3 months ago

/priority important-longterm

fabriziopandini commented 3 months ago

/assign @sbueringer To figure it out how we can implement custom metrics for cluster cache tracker (probably health checking + connection creation)