Open jayunit100 opened 1 year ago
Would it be better to raise a metric for this instead? Using logs to capture metrics seems like a strange approach.
... logs work in small clusters that dont have full blown monitoring . in my case, that is likely the norm as opposed to the exception.
maybe we can capture in a prometheus metric and just print the metric out :)
I understand that not small clusters don't have full-blown monitoring, but at the same time I don't think we should build into our controllers/API a system to count failures across different reconcile because this comes with some immediate drawbacks
Accordingly, I'm +1 to address this using metrics
+1 Sounds reasonable.
Apparently there is an integration between grpc and Prometheus and etcd has an example on how to combine them for the etcd client: https://github.com/etcd-io/etcd/blob/main/tests/integration/clientv3/examples/example_metrics_test.go
it doesnt require persistent state really, its just a heuristic. and a really its just a small hashtable at that :), and i dont think it needs to be perfect.
But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)
But "metrics" is fine to - ok if we just "print" the metrics histogram out periodically ? Then everyone wins :)
To be honest, printing metrics really sounds like an anti-pattern to me.
antipattern
Touche but it has a low cost... like... maybe ten lines of code.... no functional changes to the API...
Agree nothing beats a lavish grafana dashboard but in chaotic situations a "dumb" alternative is needed....
But fair enough: What's the workaround for Admin people
that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?
@fabriziopandini prometheus histograms.... dont they already give us this for free in memory ? I don't see the persistent state corrallary. But I agree this shouldn't be an overblown crd / api level thing. I didn't mean to suggest that ... it's more like a logging point in time heuristic.
running hundreds of clusters not enabling prometheus at the mgmt layer?
They don't need Prometheus specifically, but running Kubernetes clusters at that scale in production without monitoring doesn't sound like a sane approach to me.
prometheus histograms.... dont they already give us this for free in memory ?
Yup they should. I think Fabrizio meant if we build our own instead of just using metrics.
it requires some persistent state to ensure counters survive controller restarts
If solved with normal metrics that's usually fine. For normal counters usually rates are used (e.g. error per minute vs absolute error count) and Prometheus handles it usually well if a counter starts from 0 after a restart. Not exactly sure how it works with histograms, but I don't expect problems there if we implement normal metrics as it's a very common case.
that don't require running extra tools to get a quick topology of etcd polling failures for large / edge like scenarios. Any thing come to mind ?
I think the first step should be to implement the regular metrics.
Then we can think of how users could consume them if they don't want to use Prometheus / a regular metrics tool. I could imagine folks could add a sidecar container which does a curl against the metrics endpoint regularly and prints the results. I'm not a big fan of it, but as that would be outside of the scope of Cluster API itself .... I'm fine with that.
The answer to any logging problem can be better metrics and alerting, but logs are just a batteries included solution that work anywhere.
Why dont we split the difference:
first step should be to implement regular metrics
yup
renamed since there is agreement on the first step, which is implementing metrics helping to investigate connection problems /triage accepted
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/priority important-longterm
/assign @sbueringer To figure it out how we can implement custom metrics for cluster cache tracker (probably health checking + connection creation)
User Story
As a developer running CAPI in far separated networks, id like kubeadm controlplane manager to give me a delta between healthy nodes thats easy to read, i.e.
if i do a simple disconnect experiment on a WL cluster (w CP node w/ etcd on it),
I can see that disconnect is logged easily... and i cant easily ascertain how many nodes are / arent making that etcd connection (and yes, in general, i understand there are higher level declarative constructs and that using logs for everythign are an antipattern, but... in the real world, being able to see etcd client statistics, in realtime, is much more useful to quickly hypothesize a failure mode).
So my suggestion would be i think some kind of
in the logs ...
Desired output
Current output
Detailed Description
[A clear and concise description of what you want to happen.]
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
/kind feature