kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.63k stars 1.85k forks source link

Wrap helm chart servicemonitor in helm capability check #1481

Closed mjnagel closed 2 months ago

mjnagel commented 2 months ago

What would you like to be added:

Currently the servicemonitor in the chart is conditional on two enabled values. I think it would help to add a helm capabilities check be added to validate that the CRD for servicemonitor is present.

Why is this needed:

The current conditionals are helpful, but when deploying the same config across dev/test/staging/prod environment I may have a subset of applications on my cluster. This might not always include monitoring/prometheus so it is useful to be able to enable metrics but not have the helm install fail. This appears to be a common pattern in a lot of helm charts leveraging prometheus CRs.

/kind feature

k8s-ci-robot commented 2 months ago

This issue is currently awaiting triage.

If metrics-server contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
stevehipwell commented 2 months ago

@mjnagel if you set the serviceMonitor.enabled value to true on a cluster without the ServiceMonitor CRD installed the expected behaviour is to error and fail the apply. Helm is a promise based system and if it can't keep a promise it should error rather than silently break the promise.

The use of capabilities, especially non core APIs, should be used sparingly as they can behave incorrectly in systems where the target Kubernetes API isn't connected to Helm. So this is the reason why the Helm idiom isn't to use capabilities paired with the fail function around resources where the API might not be present, and instead fail at apply time.

The TL;DR here is that the cluster operator should know the spec of the cluster and set the Helm values accordingly.

mjnagel commented 2 months ago

@stevehipwell I think that's totally fair, although I've definitely seen both patterns followed (fail vs "safety net") depending on the helm chart. I've seen this specific capability check (servicemonitor) on a number of other charts I use (fluentbit and loki are two that jump to mind), although there are definitely others I use (to include metrics-server) that don't have it. Totally understand if the perspective from the maintainers of this chart is to error rather than silently resolve - there are certainly other ways to accomplish similar behavior for handling different environments (layered values files, etc).

Just to throw two other options out there and see if there is still a route to make this work:

  1. Change the conditional to {{- if and (or .Values.serviceMonitor.enabled (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1")) .Values.metrics.enabled -}} . It's more complex templating to be sure, but would still allow/force failure if someone explicitly enables the servicemonitor and doesn't have the capability. But it would also work for my use case where basically all I want is the servicemonitor to auto-enable based on the cluster.
  2. Basically the same idea as ^ but wrap the "autoEnable" behind another helm value. I've seen this done on other charts I think, the default behavior would not traverse the capability check but there would be a route to enable that check. Again - more template complexity + an extra helm value, but this would be the least "disruptive" option for end users (guaranteed to not change any behavior unless they explicitly use the new value).
stevehipwell commented 2 months ago

@mjnagel I don't see any benefit to changing the current conditional. It checks that you want metrics enabled and that you also want to scrape them with a ServiceMonitor before running the template.

FYI I'd have removed the capability check from the Fluent Bit chart if I wasn't working on replacement charts.

mjnagel commented 2 months ago

Thanks for the dialogue - really just comes down to making it more dynamic for me vs doing explicitly what is asked in the values. Totally see the perspective of treating values as a promise!