Closed rexagod closed 6 months ago
Need to test this.
Verified using the following configurations. Error logs were reported without this patch, but none when this was applied.
/triage accepted /assign @dgrisonnet
How about a different approach.
The current storage is built based on the assumption that the returned types never changes (which used to hold true until recently). Now that we support both the Prometheus text format and the OpenMetrics format, this isn't valid anymore.
The current approach you have is expensive because we do a search in the header of each metric to replace the type.
How about we just don't add it to the generated header stored in the cache anymore (from https://github.com/kubernetes/kube-state-metrics/blob/0bc4b19ac5cf7039fa877a43a85646367f3d63a4/pkg/metric_generator/generator.go#L93), and rather add it close to when the format is negotiated? Maybe in a new function before SanitizeHeader. We would then just need to surface the metrics family type there somehow and then make a decision on the type added to the header depending on the format that was negociated.
A couple of notes around the latest patch.
strings.Has*
and strings.LastIndex
functions, with respective complexities of O(1) and O(n). This patch removes the dependency on the latter so the whole operation is lazily run in constant time.goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
│ PUSHED │ STAGED │
│ sec/op │ sec/op vs base │
SanitizeHeaders/text-format_unique_headers-10 561.4µ ± 0% 561.2µ ± 0% ~ (p=0.811 n=10)
SanitizeHeaders/text-format_duplicate_headers-10 529.1µ ± 0% 281.6µ ± 0% -46.77% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10 4.047m ± 0% 3.985m ± 0% -1.54% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10 591.4µ ± 0% 279.1µ ± 0% -52.81% (p=0.000 n=10)
geomean 918.2µ 647.5µ -29.48%
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
│ PUSHED │ STAGED │
│ sec/op │ sec/op vs base │
SanitizeHeaders/text-format_unique_headers-10 561.39µ ± 0% 31.79µ ± 0% -94.34% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10 529.06µ ± 0% 31.82µ ± 0% -93.98% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10 4.047m ± 0% 34.068m ± 65% +741.84% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10 591.35µ ± 0% 96.74µ ± 33% -83.64% (p=0.000 n=10)
geomean 918.2µ 240.3µ -73.83%
How about we just don't add it to the generated header [...]
I'm doing something similar here, where instead of removing the multi-length type substrings, we append their enum values instead (unit-length). This means we won't need to search for the last " " anymore to trim that off, while still retaining the type data within the header required to convert the header back to the expected representation. In addition to the optimizations introduced, this makes the operation execute in constant-time.
However, either way (removing the type entirely, or appending a known-length suffix) we see that the tests need to be reformatted dramatically (lots of repeated copy-pasting), so I was wondering I'll hold off on doing that until we have an approach that looks good after reviews (either the one that's currently in this patch, or any other that may be suggested).
EDIT: Additionally, I've removed the extraneous type definitions in the CRS package and made metrics
the single source of truth. This should also help avoid encounters between "Info" and "info", etc. No user-facing features were changed.
Posting baseline benchmarks, forgot to do this earlier.
Baseline (same as main
) v/s changes in this PR.
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
│ BASELINE │ HEAD │
│ sec/op │ sec/op vs base │
SanitizeHeaders/text-format_unique_headers-10 251.62µ ± 0% 31.17µ ± 0% -87.61% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10 248.52µ ± 4% 31.22µ ± 0% -87.44% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10 250.5µ ± 0% 34093.4µ ± 49% +13509.75% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10 247.63µ ± 1% 95.04µ ± 41% -61.62% (p=0.000 n=10)
geomean 249.6µ 237.0µ -5.05%
main
+ skip empty headers v/s changes in this PR (main
outperforms this PR when protobuf sanitization is the only thing different in this PR).
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
│ BASELINE │ HEAD │
│ sec/op │ sec/op vs base │
SanitizeHeaders/text-format_unique_headers-10 249.42µ ± 0% 31.14µ ± 0% -87.51% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10 42.73µ ± 0% 31.12µ ± 0% -27.16% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10 251.1µ ± 0% 36649.4µ ± 64% +14492.68% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10 42.74µ ± 0% 93.45µ ± 38% +118.64% (p=0.000 n=10)
geomean 103.4µ 240.0µ +132.09%
To summarize, we optimized the flow with the first one quite a bit (to the point where it negated the 1.3x more impact on SanitizeHeaders
, and gave us a 0.05x boost), but if both main
and this PR have that in common, with the only feature differentiating this PR being the proto
sanitization, that impacts the performance compared to the baseline.
Please note that these are the stats compared to the baseline, and not the regex
implementation (that's present here).
Added a couple of O(1) checks from the first commit back in. The benchmarks may differ slightly, but the overall inference should be the same.
Thanks for running those benchmarks, this is super helpful to get confidence.
Good for reviews.
/hold
Need to add a test too, will push shortly.
/unhold
/hold (for @rexagod to unhold)
/lgtm
Changeset looks good, did you test this with your local setup that scrapes with protobuf?
Pasting the requested information below.
/cc @dgrisonnet
(for another review before merging)
@dgrisonnet I believe all concerns have been addressed, can you PTAL once more (and LGTM if this looks good to you)?
Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?
Yes @schahal
@rexagod didn't we discuss potentially removing OpenMetrics format support?
@dgrisonnet Yes, I believe we talked about dropping metric-standard lock-ins, and implementing a one-fits-all mechanism (probably on top of gauges) that allows us to generate an exposition format that is ingestible by most metric backends, if not all.
(friendly ping @mrueg @dgrisonnet to /lgtm)
2.11.0 introduced a regression, as metrics include stateset
entries which are refused by Prometheus. This worked correctly on 2.10 so I'm stuck until this MR is merged.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dgrisonnet, mrueg, rexagod
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/unhold
Hey @rexagod ! is this also meant to fix #1973 replicating in 2.110??
It should, yes.
Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?
@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)
Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?
@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)
@rexagod bump ^
I'll put https://github.com/kubernetes/kube-state-metrics/pull/2335 up for review today.
Sharing here for visibility: https://github.com/kubernetes/kube-state-metrics/pull/2335#issuecomment-2032148989.
What this PR does / why we need it: Fallback to
gauge
metric type when the negotiated content-type isprotofmt
-based, since Prometheus' protobuf machinery does not recognize all OpenMetrics types (info
andstatesets
in this context).How does this change affect the cardinality of KSM: None.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged): Fixes #2248