Expose metrics in protobuf format

invidian commented 3 years ago

What would you like to be added:

kube-apiserver and kubelet (and probably other core Kubernetes components) supports scraping Prometheus metrics using protobuf:

> GET /metrics HTTP/2
> Host: 10.0.0.12:10250
> user-agent: curl/7.78.0
> accept: application/vnd.google.protobuf;proto=io.prometheus.client.MetricFamily;encoding=delimited;q=0.7,text/plain;version=0.0.4;q=0.3
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [1841 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
} [5 bytes data]
< HTTP/2 200
< content-type: application/vnd.google.protobuf; proto=io.prometheus.client.MetricFamily; encoding=delimited
< date: Mon, 11 Oct 2021 08:08:42 GMT
<

KSM only accepts plaintext protocol though:

09:24:13.100367 lo    In  IP6 (flowlabel 0x0b53e, hlim 64, next-header TCP (6) payload length: 290) ::1.42542 > ::1.8080: Flags [P.], cksum 0x012a (incorrect -> 0xc0de), seq 1:259, ack 1, win 512, options [nop,nop,TS val 963065885 ecr 963065885], length 258: HTTP, length: 258
        GET /metrics HTTP/1.1
        Host: localhost:8080
        User-Agent: Go-http-client/1.1
        Accept: application/vnd.google.protobuf;proto=io.prometheus.client.MetricFamily;encoding=delimited;q=0.7,text/plain;version=0.0.4;q=0.3
        Accept-Encoding: gzip
        Connection: close

09:24:13.100379 lo    In  IP6 (flowlabel 0x541a0, hlim 64, next-header TCP (6) payload length: 32) ::1.8080 > ::1.42542: Flags [.], cksum 0x0028 (incorrect -> 0x7144), ack 259, win 510, options [nop,nop,TS val 963065885 ecr 963065885], length 0
09:24:13.225339 lo    In  IP6 (flowlabel 0x541a0, hlim 64, next-header TCP (6) payload length: 32800) ::1.8080 > ::1.42542: Flags [P.], cksum 0x8028 (incorrect -> 0x23cb), seq 1:32769, ack 259, win 512, options [nop,nop,TS val 963066010 ecr 963065885], length 32768: HTTP, length: 32768
        HTTP/1.1 200 OK
        Content-Type: text/plain; version=0.0.4
        Date: Mon, 11 Oct 2021 07:24:13 GMT
        Connection: close
        Transfer-Encoding: chunked

Protobuf definitely offers smaller network traffic, as number of data transferred varies between 3-7 times less from quick testing in favor of protobuf. I've seen #498 issue, but I couldn't find anything related to this issue. Possibly protobuf encoding is also more CPU efficient?

Why is this needed:

To make KSM consume less resources and to make it more aligned with core Kubernetes components.

Additional context

Discovered as part of work on https://github.com/newrelic/nri-kubernetes/issues/234.

Serializator commented 2 years ago

DISCLAIMER; I do not mean to bash on the code which is written or on the person who wrote the code. It is only meant as "what should be done to allow for an "easier" path of implementing Protocol Buffers" and not in any way as criticism.

(metricshandler.MetricsHandler).ServeHTTP is responsible for serving HTTP requests for metrics. Though metricshandler.MetricsHandler is not only responsible for writing metrics to the response (to be a bit more specific than "serving HTTP requests") but as well for sharding and applying compression (GZIP).

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metricshandler/metrics_handler.go

I think this does too much and should be refactored before even trying to implement Protocol Buffers.

Sharding should be refactored outside of the HTTP handler, such that the HTTP handler is only aware of the metrics that should be written.
GZIP compression should be "decorated" onto the the HTTP handler or http.ResponseWriter before or after it is passed into (metricshandler.MetricsHandler).ServeHTTP, thus taking away the responsibility of applying compression away from the HTTP handler.

Though this goes further into (metricsstore.MetricsWriter).WriteAll which makes assumptions about the data format as well by manually writing \n to the response after the "HELP" of a metric.

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metrics_store/metrics_writer.go#L61-L69

In metricsstore.MetricsStore the metrics are already kept as a multi-dimensional byte array ([][]byte). Though I don't know whether this is a problem for implementing Protocol Buffers or not.

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metrics_store/metrics_store.go#L39

//cc @fpetkovski what do you think? In terms of the specific arguments I made about the code that should be refactored but as well about the approach of doing a refactor before implementing Protocol Buffers to keep it small and manageable.

A refactor would not only be about separating responsibilities but also be a bit of thinking ahead in terms of what abstractions to put in place to allow for easier implementation of Protocol Buffers in the future (or other formats for that matter).

Serializator commented 2 years ago

I asked this in Slack as well but will ask it on here too simply so that everything and every question / answer is kept within the issue and doesn't get lost 👍🏼

I was thinking / prototyping a bit to implement Protocol Buffers in KSM and a simple question arose with maybe a simple answer but I don't know.

Why hasn't KSM used the Go client library by Prometheus to implement its metrics? The question arose because the client seems to already support Protocol Buffers.

https://kubernetes.slack.com/archives/CJJ529RUY/p1640638803044600

invidian commented 2 years ago

Thanks for looking into it @Serializator!

Why hasn't KSM used the Go client library by Prometheus to implement its metrics?

I forgot to mention that in the opening post. I think that would definitely be a preferable solution for this issue!

Serializator commented 2 years ago

From @fpetkovski

The main reason for this is because KSM dumps a lot of metrics, especially in large clusters, and using the go client library has proven to be slow and memory intensive in the past.

This might be a useful read to get a bit more context https://github.com/prometheus/client_golang/discussions/917

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fpetkovski commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

invidian commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

invidian commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

invidian commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

invidian commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

invidian commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kube-state-metrics/issues/1604#issuecomment-2016572092): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kube-state-metrics

Expose metrics in protobuf format #1604