Monitoring gaps in apiserver extension mechanisms

shyamjvs commented 1 year ago

What would you like to be added?

Following up from this discussion - https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1500602135

There are different components that fall in the critical path for the k8s API today (apiserver (core), authn/authz webhooks, mutating/validating/conversion webhooks, etcd, extension apiservers - maybe more I'm missing). While some of those do, bunch of them don't seem to have metrics tracking request/error counts and latency metrics. Here's what I found so far (will update as we learn more):

Authentication webhook
- [x] request/error counts (xref)
- [x] request latencies (xref)
Authorization webhook
- [x] request/error counts (#117211)
- [x] request latencies (#117211)
Admission webhooks (mutating/validating)
- [x] request/error counts (xref)
- [x] request latencies (xref)
CRD Conversion webhook
- [x] request/error counts (#118292)
- [x] request latencies (#118292)
Etcd
- [x] request/error counts (#117222)
- [x] request latencies (xref)
Extension apiserver
- [ ] request/error counts
- [ ] request latencies

Finally, wrt the apiserver itself, we measure request/error counts and these flavors of latency metrics today:

[x] e2e latency (capturing customer experience)
[x] SLI-based latency (measuring cloud-provider QoS)
[ ] ~We discussed adding a third flavor of latency metrics that measures only the apiserver "core" latency (not including any external callback/webhook mechanisms) here.~ We don't have consensus that benefits of doing so outweigh the additional metric churn. We can revisit later as TBD

Why is this needed?

Metrics at component/dependency level allow us to:

track each component's performance in isolation
set internal (non-customer-facing) SLOs for teams owning those components
narrow down the root-cause for API errors/latencies easily

/sig api-machinery /sig auth /sig scalability /kind feature /help

k8s-ci-robot commented 1 year ago

@shyamjvs: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/kubernetes/issues/117167): >### What would you like to be added? > >Following up from this discussion - https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1500602135 > >There are different components that fall in the critical path for the k8s API today (apiserver (core), authn/authz webhooks, mutating/validating/conversion webhooks, etcd, extension apiservers - maybe more I'm missing). While some of those do, bunch of them don't seem to have metrics tracking request/error counts and latency metrics. Here's what I found so far (will update as we learn more): > >- [x] [Authentication webhook](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/filters/authentication.go#L47) > - [x] request/error counts ([xref](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/filters/metrics.go#L53)) > - [x] request latencies ([xref](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/filters/metrics.go#L62)) >- [ ] [Authorization webhook](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/filters/authorization.go#L45) > - [ ] request/error counts > - [ ] request latencies >- [x] [Admission webhooks](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go#L186) (mutating/validating) > - [x] request/error counts ([xref](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/admission/metrics/metrics.go#L201-L219)) > - [x] request latencies ([xref](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/admission/metrics/metrics.go#L175)) >- [ ] [CRD Conversion webhook](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/conversion/webhook_converter.go#L216) > - [ ] request/error counts > - [ ] request latencies >- [ ] [Etcd](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L139) > - [ ] request/error counts > - [x] request latencies ([xref](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go#L36)) >- [ ] [Extension apiserver](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go#L113) > - [ ] request/error counts > - [ ] request latencies > >Finally, wrt the apiserver itself, we measure request/error counts and these flavors of latency metrics today: >- [x] [e2e latency](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L91) (capturing customer experience) >- [x] [SLI-based latency](https://github.com/kubernetes/kubernetes/blob/ad18954259eae3db51bac2274ed4ca7304b923c4/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L120) (measuring cloud-provider QoS) >- [ ] We discussed adding a third flavor of latency metrics that measures only the apiserver "core" latency (not including any external callback/webhook mechanisms) [here](https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1499652447). We don't yet have consensus that benefits of doing so outweigh the additional metric churn. We can revisit that later as TBD > >### Why is this needed? > >Metrics at component/dependency level allow us to: >- track each component's performance in isolation >- set internal (non-customer-facing) SLOs for teams owning those components >- narrow down the root-cause for API latencies easily > >/sig api-machinery >/sig auth >/sig scalability >/kind feature >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

shyamjvs commented 1 year ago

cc @wojtek-t @lavalamp

Also please correct me if I misread any of the code.

HirazawaUi commented 1 year ago

/assign I think I can modify the Authorization webhook section

my-git9 commented 1 year ago

/assign I want work for CRD Conversion webhook section

hysyeah commented 1 year ago

/assign I will work for Extension apiserver section

iyear commented 1 year ago

/assign I'm willing to work for Etcd request/error counts section.

dgrisonnet commented 1 year ago

@shyamjvs what is the gap you are referring to with the extension apiserver? As far as I am aware, aggregated apiservers in general should implement the same generic interface as the kube-apiserver and, as such, should expose the same requests count/latency metrics as the kube-apiserver.

shyamjvs commented 1 year ago

@dgrisonnet IIUC the main apiserver isn't exposing a metric showing latency/error counts for calls made to the aggregated apiservers though (like we do today for webhooks). As a result, monitoring those dependencies centrally via the /metrics API isn't possible today.

dgrisonnet commented 1 year ago

I am a bit on the fence with that one. I can see your point, but as far as I am aware the situation is a bit different. Aggregated apiservers should be written based on the generic apiserver library which already provide request counters and latency metrics. These information should already be exposed today, so having another set of metrics in the kube-apiserver feels redundant and unnecessary to me.

@wojtek-t do you perhaps have any opinion on that?

shyamjvs commented 1 year ago

These information should already be exposed today

For my clarify - could you explain how a cluster operator can get those metrics via the k8s API?

so having another set of metrics in the kube-apiserver feels redundant and unnecessary to me

Measuring a dependency from caller side is more robust than doing it from callee side. For e.g let's say the aggregated apiserver is down, we can't measure the failure counts without caller-side metrics. Similarly, if aggregated apiserver is fast to serve the response but say there's a network delay while sending the request from kube-apiserver to it, that won't be caught by the callee-side latency metric.

dgrisonnet commented 1 year ago

For my clarify - could you explain how a cluster operator can get those metrics via the k8s API?

They would need to scrape the aggregated apiserver metrics endpoint. If I take metrics-server as an example, the SLI metrics are available on its /metrics endpoint.

Measuring a dependency from caller side is more robust than doing it from callee side. For e.g let's say the aggregated apiserver is down, we can't measure the failure counts without caller-side metrics. Similarly, if aggregated apiserver is fast to serve the response but say there's a network delay while sending the request from kube-apiserver to it, that won't be caught by the callee-side latency metric.

We already have the main request/latency metrics from the kube-apiserver to capture the latency and the errors in an e2e fashion. Taking the example of metrics-server again, it exposes two resources, PodMetrics and NodeMetrics. You will be able to get the e2e latency from the kube_apiserver_request_durations_seconds metric by setting the resource/group labels accordingly.

shyamjvs commented 1 year ago

The kube-apiserver e2e metric is good for capturing end-user experience, but the problem is it's not granular enough to catch dependency issues (required by cluster operators). This is how we arrived at having a dedicated set of metrics for dependency calls - https://github.com/kubernetes/kubernetes/pull/116420#issuecomment-1500645585. And the reason why we I think we should measure that from caller-side is this - https://github.com/kubernetes/kubernetes/issues/117167#issuecomment-1509075983.

Fwiw - the above pattern is similar to what we're already doing for webhooks today.

lavalamp commented 1 year ago

Apiserver should have metrics and not just the aggregated apiservers, the metrics of which may not be easily gettable by platform operators. Additionally having both lets you compare and see what the network cost is.

dgrisonnet commented 1 year ago

Measuring a dependency from caller side is more robust than doing it from callee side.

That's a good point that I missed earlier. It is also the strategy that client-go has today. It sounds good to me, thanks for the clarification.

wojtek-t commented 1 year ago

Apiserver should have metrics and not just the aggregated apiservers, the metrics of which may not be easily gettable by platform operators. Additionally having both lets you compare and see what the network cost is.

+1

fedebongio commented 1 year ago

/cc @logicalhan @cici37 /sig instrumentation

shyamjvs commented 1 year ago

@my-git9 @hysyeah - do you still plan to work on the CRD and extension apiserver metrics? If so, feel free to ask any questions you may have. If not, please unassign yourself to open it up for other takers.

MikeSpreitzer commented 1 year ago

@shyamjvs : Did #117211 really finish this? I see several boxes still not checked.

shyamjvs commented 1 year ago

Yeah that PR shouldn't have closed this issue. We still have CRD and extension apiserver metrics pending.

/reopen

k8s-ci-robot commented 1 year ago

@shyamjvs: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/117167#issuecomment-1536906670): >Yeah that PR shouldn't have closed this issue. We still have CRD and extension apiserver metrics pending. > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

rayowang commented 1 year ago

/assign I can work for Extension apiserver section.

fedebongio commented 1 year ago

/triage accepted /help

shyamjvs commented 1 year ago

/unassign @my-git9

Since I haven't heard back from them in a month, opening up the CRD conversion webhook task for any takers.

cchapla commented 1 year ago

/assign I can work on CRD Conversion webhook.

stlaz commented 1 year ago

/assign @s-urbaniak for sig-auth review of the already merged PRs

shyamjvs commented 1 year ago

The CRD conversion webhook metrics have also been added. Thanks @cchapla!

@rayowang - do you still plan to work on the extensions apiserver metrics? If not, please unassign yourself to open it up for other takers.

rayowang commented 1 year ago

The CRD conversion webhook metrics have also been added. Thanks @cchapla!

@rayowang - do you still plan to work on the extensions apiserver metrics? If not, please unassign yourself to open it up for other takers.

Sorry, I've been a little busy lately, I'll finish it this week.

rayowang commented 1 year ago

@shyamjvs Hello, I submitted pr a few days ago, but now I can’t enter GitHub Action Job, can you help me to label ok-to-test and review by the way, thank you.

k8s-triage-robot commented 1 month ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

HirazawaUi commented 1 month ago

/triage accepted

kubernetes / kubernetes