Open shyamjvs opened 1 year ago
@shyamjvs: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
cc @wojtek-t @lavalamp
Also please correct me if I misread any of the code.
/assign
I think I can modify the Authorization webhook
section
/assign
I want work for CRD Conversion webhook
section
/assign
I will work for Extension apiserver
section
/assign
I'm willing to work for Etcd request/error counts
section.
@shyamjvs what is the gap you are referring to with the extension apiserver? As far as I am aware, aggregated apiservers in general should implement the same generic interface as the kube-apiserver and, as such, should expose the same requests count/latency metrics as the kube-apiserver.
@dgrisonnet IIUC the main apiserver isn't exposing a metric showing latency/error counts for calls made to the aggregated apiservers though (like we do today for webhooks). As a result, monitoring those dependencies centrally via the /metrics API isn't possible today.
I am a bit on the fence with that one. I can see your point, but as far as I am aware the situation is a bit different. Aggregated apiservers should be written based on the generic apiserver library which already provide request counters and latency metrics. These information should already be exposed today, so having another set of metrics in the kube-apiserver feels redundant and unnecessary to me.
@wojtek-t do you perhaps have any opinion on that?
These information should already be exposed today
For my clarify - could you explain how a cluster operator can get those metrics via the k8s API?
so having another set of metrics in the kube-apiserver feels redundant and unnecessary to me
Measuring a dependency from caller side is more robust than doing it from callee side. For e.g let's say the aggregated apiserver is down, we can't measure the failure counts without caller-side metrics. Similarly, if aggregated apiserver is fast to serve the response but say there's a network delay while sending the request from kube-apiserver to it, that won't be caught by the callee-side latency metric.
For my clarify - could you explain how a cluster operator can get those metrics via the k8s API?
They would need to scrape the aggregated apiserver metrics endpoint. If I take metrics-server as an example, the SLI metrics are available on its /metrics endpoint.
Measuring a dependency from caller side is more robust than doing it from callee side. For e.g let's say the aggregated apiserver is down, we can't measure the failure counts without caller-side metrics. Similarly, if aggregated apiserver is fast to serve the response but say there's a network delay while sending the request from kube-apiserver to it, that won't be caught by the callee-side latency metric.
We already have the main request/latency metrics from the kube-apiserver to capture the latency and the errors in an e2e fashion. Taking the example of metrics-server again, it exposes two resources, PodMetrics and NodeMetrics. You will be able to get the e2e latency from the kube_apiserver_request_durations_seconds metric by setting the resource/group labels accordingly.
The kube-apiserver e2e metric is good for capturing end-user experience, but the problem is it's not granular enough to catch dependency issues (required by cluster operators). This is how we arrived at having a dedicated set of metrics for dependency calls - https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1500645585. And the reason why we I think we should measure that from caller-side is this - https://github.com/kubernetes/kubernetes/issues/117167#issuecomment-1509075983.
Fwiw - the above pattern is similar to what we're already doing for webhooks today.
Apiserver should have metrics and not just the aggregated apiservers, the metrics of which may not be easily gettable by platform operators. Additionally having both lets you compare and see what the network cost is.
Measuring a dependency from caller side is more robust than doing it from callee side.
That's a good point that I missed earlier. It is also the strategy that client-go has today. It sounds good to me, thanks for the clarification.
Apiserver should have metrics and not just the aggregated apiservers, the metrics of which may not be easily gettable by platform operators. Additionally having both lets you compare and see what the network cost is.
+1
/cc @logicalhan @cici37 /sig instrumentation
@my-git9 @hysyeah - do you still plan to work on the CRD and extension apiserver metrics? If so, feel free to ask any questions you may have. If not, please unassign yourself to open it up for other takers.
@shyamjvs : Did #117211 really finish this? I see several boxes still not checked.
Yeah that PR shouldn't have closed this issue. We still have CRD and extension apiserver metrics pending.
/reopen
@shyamjvs: Reopened this issue.
/assign I can work for Extension apiserver section.
/triage accepted /help
/unassign @my-git9
Since I haven't heard back from them in a month, opening up the CRD conversion webhook task for any takers.
/assign I can work on CRD Conversion webhook.
/assign @s-urbaniak for sig-auth review of the already merged PRs
The CRD conversion webhook metrics have also been added. Thanks @cchapla!
@rayowang - do you still plan to work on the extensions apiserver metrics? If not, please unassign yourself to open it up for other takers.
The CRD conversion webhook metrics have also been added. Thanks @cchapla!
@rayowang - do you still plan to work on the extensions apiserver metrics? If not, please unassign yourself to open it up for other takers.
Sorry, I've been a little busy lately, I'll finish it this week.
@shyamjvs Hello, I submitted pr a few days ago, but now I can’t enter GitHub Action Job, can you help me to label ok-to-test and review by the way, thank you.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
What would you like to be added?
Following up from this discussion - https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1500602135
There are different components that fall in the critical path for the k8s API today (apiserver (core), authn/authz webhooks, mutating/validating/conversion webhooks, etcd, extension apiservers - maybe more I'm missing). While some of those do, bunch of them don't seem to have metrics tracking request/error counts and latency metrics. Here's what I found so far (will update as we learn more):
Finally, wrt the apiserver itself, we measure request/error counts and these flavors of latency metrics today:
Why is this needed?
Metrics at component/dependency level allow us to:
/sig api-machinery /sig auth /sig scalability /kind feature /help