fix: fallback to `gauge` for `protofmt`-based negotiations

rexagod commented 9 months ago

What this PR does / why we need it: Fallback to gauge metric type when the negotiated content-type is protofmt-based, since Prometheus' protobuf machinery does not recognize all OpenMetrics types (info and statesets in this context).

How does this change affect the cardinality of KSM: None.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #2248

rexagod commented 9 months ago

Need to test this.

rexagod commented 9 months ago

Verified using the following configurations. Error logs were reported without this patch, but none when this was applied.

Details

### Prometheus error log ``` ts=2023-12-13T16:01:02.547Z caller=scrape.go:1418 level=debug component="scrape manager" scrape_pool=kube-state-metrics target=http://127.0.0.1:8080/metrics msg="Append failed" err="invalid metric type \"info\"" ``` ### KSM CRS Configuration ``` ... metrics: - name: "foo_info" help: "foo_help" each: type: Info info: path: [metadata] labelsFromPath: "name": [name] ``` ### Prometheus Configuration ``` global: evaluation_interval: 15s scrape_interval: 15s scrape_protocols: [PrometheusProto] scrape_configs: - job_name: 'kube-state-metrics' static_configs: - targets: ['127.0.0.1:8080'] ``` ### Generated CRS Metric (from Prometheus UI) ``` kube_customresource_foo_info{customresource_group="contoso.com", customresource_kind="MyPlatform", customresource_version="v1alpha1", instance="127.0.0.1:8080", job="kube-state-metrics", name="test-dotnet-app"} 1 ```

logicalhan commented 9 months ago

/triage accepted /assign @dgrisonnet

dgrisonnet commented 9 months ago

How about a different approach.

The current storage is built based on the assumption that the returned types never changes (which used to hold true until recently). Now that we support both the Prometheus text format and the OpenMetrics format, this isn't valid anymore.

The current approach you have is expensive because we do a search in the header of each metric to replace the type.

How about we just don't add it to the generated header stored in the cache anymore (from https://github.com/kubernetes/kube-state-metrics/blob/0bc4b19ac5cf7039fa877a43a85646367f3d63a4/pkg/metric_generator/generator.go#L93), and rather add it close to when the format is negotiated? Maybe in a new function before SanitizeHeader. We would then just need to surface the metrics family type there somehow and then make a decision on the type added to the header depending on the format that was negociated.

rexagod commented 9 months ago

A couple of notes around the latest patch.

The earlier approach used strings.Has* and strings.LastIndex functions, with respective complexities of O(1) and O(n). This patch removes the dependency on the latter so the whole operation is lazily run in constant time.
Manipulate headers only when they are not duplicates (since they will be removed before any operations).

Benchmarks after the first optimization (comparing the first commit v/s second one on this PR)

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │   PUSHED    │               STAGED                │
                                                  │   sec/op    │   sec/op     vs base                │
SanitizeHeaders/text-format_unique_headers-10       561.4µ ± 0%   561.2µ ± 0%        ~ (p=0.811 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    529.1µ ± 0%   281.6µ ± 0%  -46.77% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10      4.047m ± 0%   3.985m ± 0%   -1.54% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   591.4µ ± 0%   279.1µ ± 0%  -52.81% (p=0.000 n=10)
geomean                                             918.2µ        647.5µ       -29.48%

Benchmarks after both optimizations (comparing the first commit v/s second one on this PR)

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │    PUSHED    │                 STAGED                 │
                                                  │    sec/op    │    sec/op      vs base                 │
SanitizeHeaders/text-format_unique_headers-10       561.39µ ± 0%    31.79µ ±  0%   -94.34% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    529.06µ ± 0%    31.82µ ±  0%   -93.98% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       4.047m ± 0%   34.068m ± 65%  +741.84% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   591.35µ ± 0%    96.74µ ± 33%   -83.64% (p=0.000 n=10)
geomean                                              918.2µ         240.3µ         -73.83%

How about we just don't add it to the generated header [...]

I'm doing something similar here, where instead of removing the multi-length type substrings, we append their enum values instead (unit-length). This means we won't need to search for the last " " anymore to trim that off, while still retaining the type data within the header required to convert the header back to the expected representation. In addition to the optimizations introduced, this makes the operation execute in constant-time.

However, either way (removing the type entirely, or appending a known-length suffix) we see that the tests need to be reformatted dramatically (lots of repeated copy-pasting), so I was wondering I'll hold off on doing that until we have an approach that looks good after reviews (either the one that's currently in this patch, or any other that may be suggested).

EDIT: Additionally, I've removed the extraneous type definitions in the CRS package and made metrics the single source of truth. This should also help avoid encounters between "Info" and "info", etc. No user-facing features were changed.

rexagod commented 9 months ago

Posting baseline benchmarks, forgot to do this earlier.

Baseline (same as main) v/s changes in this PR.

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                              │   BASELINE   │                   HEAD                    │
                                              │    sec/op    │     sec/op      vs base                   │
SanitizeHeaders/text-format_unique_headers-10       251.62µ ± 0%     31.17µ ±  0%     -87.61% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    248.52µ ± 4%     31.22µ ±  0%     -87.44% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       250.5µ ± 0%   34093.4µ ± 49%  +13509.75% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   247.63µ ± 1%     95.04µ ± 41%     -61.62% (p=0.000 n=10)
geomean                                              249.6µ          237.0µ            -5.05%

main + skip empty headers v/s changes in this PR (main outperforms this PR when protobuf sanitization is the only thing different in this PR).

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │   BASELINE   │                   HEAD                    │
                                                  │    sec/op    │     sec/op      vs base                   │
SanitizeHeaders/text-format_unique_headers-10       249.42µ ± 0%     31.14µ ±  0%     -87.51% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10     42.73µ ± 0%     31.12µ ±  0%     -27.16% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       251.1µ ± 0%   36649.4µ ± 64%  +14492.68% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10    42.74µ ± 0%     93.45µ ± 38%    +118.64% (p=0.000 n=10)
geomean                                              103.4µ          240.0µ          +132.09%

To summarize, we optimized the flow with the first one quite a bit (to the point where it negated the 1.3x more impact on SanitizeHeaders, and gave us a 0.05x boost), but if both main and this PR have that in common, with the only feature differentiating this PR being the proto sanitization, that impacts the performance compared to the baseline.

Please note that these are the stats compared to the baseline, and not the regex implementation (that's present here).

rexagod commented 9 months ago

Added a couple of O(1) checks from the first commit back in. The benchmarks may differ slightly, but the overall inference should be the same.

mrueg commented 9 months ago

Thanks for running those benchmarks, this is super helpful to get confidence.

rexagod commented 9 months ago

Good for reviews.

rexagod commented 9 months ago

/hold

Need to add a test too, will push shortly.

rexagod commented 9 months ago

/unhold

mrueg commented 9 months ago

/hold (for @rexagod to unhold)

/lgtm

Changeset looks good, did you test this with your local setup that scrapes with protobuf?

rexagod commented 9 months ago

Pasting the requested information below.

Testing locally on a Prometheus Protobuf setup.

Benchmarking this PR against `main`.

``` goos: darwin goarch: arm64 pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store │ BASELINE │ HEAD │ │ sec/op │ sec/op vs base │ SanitizeHeaders/text-format_unique_headers-10 255.0µ ± 3% 1225.8µ ± 3% +380.70% (p=0.000 n=10) SanitizeHeaders/text-format_duplicate_headers-10 258262.50n ± 4% 10.87n ± 0% -100.00% (p=0.000 n=10) SanitizeHeaders/proto-format_unique_headers-10 2.071m ± 0% SanitizeHeaders/proto-format_duplicate_headers-10 18.50n ± 1% geomean 256.6µ 4.754µ -98.58% ```

rexagod commented 9 months ago

/cc @dgrisonnet

(for another review before merging)

rexagod commented 8 months ago

@dgrisonnet I believe all concerns have been addressed, can you PTAL once more (and LGTM if this looks good to you)?

schahal commented 8 months ago

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

dgrisonnet commented 7 months ago

Yes @schahal

dgrisonnet commented 7 months ago

@rexagod didn't we discuss potentially removing OpenMetrics format support?

rexagod commented 7 months ago

@dgrisonnet Yes, I believe we talked about dropping metric-standard lock-ins, and implementing a one-fits-all mechanism (probably on top of gauges) that allows us to generate an exposition format that is ingestible by most metric backends, if not all.

rexagod commented 7 months ago

(friendly ping @mrueg @dgrisonnet to /lgtm)

remram44 commented 6 months ago

2.11.0 introduced a regression, as metrics include stateset entries which are refused by Prometheus. This worked correctly on 2.10 so I'm stuck until this MR is merged.

dgrisonnet commented 6 months ago

/lgtm /approve

k8s-ci-robot commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, mrueg, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes/kube-state-metrics/blob/main/OWNERS)~~ [dgrisonnet,mrueg,rexagod] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

dgrisonnet commented 6 months ago

/unhold

Efrat19 commented 6 months ago

Hey @rexagod ! is this also meant to fix #1973 replicating in 2.110??

rexagod commented 6 months ago

It should, yes.

LaikaN57 commented 6 months ago

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)

LaikaN57 commented 6 months ago

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)

@rexagod bump ^

rexagod commented 6 months ago

I'll put https://github.com/kubernetes/kube-state-metrics/pull/2335 up for review today.

rexagod commented 6 months ago

Sharing here for visibility: https://github.com/kubernetes/kube-state-metrics/pull/2335#issuecomment-2032148989.

kubernetes / kube-state-metrics

fix: fallback to `gauge` for `protofmt`-based negotiations #2270

Benchmarks after the first optimization (comparing the first commit v/s second one on this PR)

Benchmarks after both optimizations (comparing the first commit v/s second one on this PR)