Helm chart for Kubernetes metrics quickstart #562

jaronoff97 commented 1 year ago

Many prometheus and kubernetes users are familiar with the kube-prometheus-stack chart which aims to quickly set up and manage a prometheus and grafana installation for a user that collects mostly all of the Kubernetes metrics available. It achieves this using the Prometheus operator and ServiceMonitor and PodMonitor custom resources that configure a user's Prometheus scrape config. We have the ability to do the same using the OpenTelemetry Operator and the Target Allocator. In order to provide an easy and familiar migration path to existing (or new) Prometheus and Kubernetes users, I created the kube-otel-stack chart which installs a pre-configured collector and target allocator to dynamically ServiceMonitor and PodMonitor custom resources to scrape various Kubernetes metrics. You can see below some of the metrics this collector is scraping.

This has since become a requested feature across the otel slack from what i can tell, as I've DM'ed this chart to at least 3 different people at this point. I was wondering if it would be welcome for me to clean up and make more generic this slightly opinionated helm chart and donate it to the repository.

austinlparker commented 1 year ago

I've seen O(tens) of requests for this on the OpenTelemetry slack channels. Having it in the community would be great, as we could promote its adoption more widely.

TylerHelmuth commented 1 year ago

I am certainly interested in this if users are interested in this. A couple questions:

  1. @jaronoff97 if this chart was accepted, would you be available as a CodeOwner for the chart?
  2. Is there anything specific to Lightstep that would need stripped out or can the entire chart be taken verbatim?
  3. Is the chart testable via chart-testing?
  4. What has the upkeep of the chart been like? Is it relatively stable (except for operator bumps)?
jaronoff97 commented 1 year ago

Thanks for your questions :)

  1. Yes, happy to be a codeowner for it.
  2. Yes, I would generalize anything that is LS specific in the PR i would make to the repo
  3. I'm not sure how chart-testing works (never used it before.) I think as long as we could install the operator as part of the testing flow, it should be fine?
  4. Relatively stable, occasionally there's a small change here and there. I'd imagine we'd get some more requests as more people use this, but it shouldn't be changing too drastically
povilasv commented 1 year ago

I really like this idea, but I have a question - is there a plan to move away from kube-state-metrics, node-exporter etc in favour of otel collector native receivers (k8sclusterreceiver and hostmetrics) ?

I think in general we should strive to collect all the prometheus metrics from k8s components, but not use any of the Prometheus ecosystem components and use Collector's native features :)

TylerHelmuth commented 1 year ago

@jaronoff97 I'm also curious if your chart handles the installation of the operator and the OpentelemetryCollector object like discussed here:

jcdauchy-moodys commented 1 year ago

I have been using this chart for 3 weeks, it is working out of the box but it will need to be improved (of course). It brings almost the same functionalities as "Prometheus Operator with kube-prometheus-stack chart". It is much lightweight as you only deploy "agents" to scrape your logs/metrics/traces. I am using it to send metrics to AWS AMP (managed prometheus).

Here are the main issue I encountered so far :

Thanks for the good work.

jaronoff97 commented 1 year ago

updates/context setting: @TylerHelmuth I still want to donate this if that's still okay. I've validated with a few other people that this would be a great thing for the community to have. The only blocker for this work is to figure out if we can install the operator in the same chart which would make for a better experience. My team is going to be investigating this.

TylerHelmuth commented 1 year ago

@jaronoff97 sounds good. @open-telemetry/helm-approvers please add your thoughts.

Allex1 commented 1 year ago

I approve. Thanks @jaronoff97

dmitryax commented 1 year ago

I don't think I agree that we need another chart for this. I'd rather go with adding the TA option to the collector chart.

Also, why do we promote using Prometheus for scraping kubernetes/kubelet metrics instead of using specialized collector receivers that collect metrics complaint with OTel semantic conventions without additional transformations?

Allex1 commented 1 year ago

I think this would provide a bridge for existing kps users that otherwise would not care to switch (afaik Prometheus is still used in ~ 99.x% of Kubernetes deployments for cluster monitoring). Reusing the existing Prometheus-Operator objects would smooth out that migration.

TylerHelmuth commented 1 year ago

I also see value in a "transition" chart. Long term (like long long term), I think a need for a chart like this diminishes, but for users today who have extensive Prometheus setups but want to try out OTel or start transitioning to OTel I think this chart fits their needs.

dmitryax commented 1 year ago

Ok, I'm not blocking it. If most @open-telemetry/helm-approvers think it's a good addition, let's add it

dmitryax commented 1 year ago

The name should somehow reflect the Prometheus bridge/transition in its name. kube-otel-stack doesn't seem right to me

TylerHelmuth commented 1 year ago

Could also be cool to include somewhere how to grab the same telemetry using the collector and its components.

povilasv commented 1 year ago

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

I was thinking having kube-otel-stack which initially works like kube-prometheus-stack, collects metrics using Prometheus, but slowly we could refactor it to use native OpenTelemetry Collector receivers and functionality.

Allex1 commented 1 year ago

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

We should probably assume that the majority of admins scrape their k8s api endpoints with Prometheus via prometheus-operator objects like Service/PodMonitor that we can reuse with this stack. As such a user, initially I would have both Prometheus and otel collector scraping this data and comparing the results/setup complexity before making any decision.

austinlparker commented 1 year ago

I would also see this as a 'transition' chart, but the migration path to me is something like...

kube-prometheus-stack -> kube-otel-stack -> opentelemetry-operator

In the (admittedly, kinda far?) future, I can see the operator using native OpenTelemetry components and monitoring CRDs to perform the same basic functions as this stack, but in the short-to-medium term, having this in the org will give us a pat answer for "how should I monitor k8s with OpenTelemetry?"

austinlparker commented 1 year ago

Hi, quick bump on this issue - one pretty common piece of feedback we got at KubeCon EU was the amount of people who didn't know the operator existed. I believe getting this chart brought in would help a lot with that, as we could then signpost this from the docs as a "how to get started with kubernetes".

TylerHelmuth commented 1 year ago

@dmitryax is there anything else we're waiting on before accepting PRs adding this chart?

jaronoff97 commented 1 year ago

@TylerHelmuth I think this issue is still a blocker. I'm going to run some tests right now to track this down and solve it.

jaronoff97 commented 1 year ago

Okay after a little mish-moshing of things... i was able to get a chart that installs cert-manager (a requirement of the operator), the operator, and a collector to install together in a single chart. The problem is that it doesn't all install at once for a few reasons.

Option where we install cert-manager with the chart

TL;DR there are some race conditions and annoyances here ### First installation In order for the first installation to work for the chart, you need to set the operator's admission webhook to _false_. This is because helm installs resources in a particular order ([here]( and if you attempt to install cert-manager and the operator simultaneously with the webhook enabled you get the following error: ``` Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "", unable to recognize "": no matches for kind "Issuer" in version ""] ``` This is fine, because we can just initially disable the webhook on otel-operator installation so the otel-operator can come up healthy _after_ the CRDs for cert-manager are installed. ### Second installation Now we have to re-enable the webhook, applying that again will get you another fun group of errors. ``` ⎨ 11:46:28⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬ ⫸ helm install kube-otel-stack . -f values.yaml Error: INSTALLATION FAILED: Internal error occurred: failed calling webhook "": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.default.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp connect: connection refused ⎨ ✘⎬ ⎨ 11:46:43⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬ ⫸ helm upgrade kube-otel-stack . -f values.yaml Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "": failed to call webhook: Post "https://kube-otel-stack-cert-manager-webhook.default.svc:443/mutate?timeout=10s": dial tcp connect: connection refused ``` These are due to pods not being ready in order for the webhooks to be called. ### Third installation After waiting maybe ten seconds, instead of being impatient like me... you are able to successfully install the chart in its entirety ``` ⫸ helm upgrade kube-otel-stack . -f values.yaml --install Release "kube-otel-stack" has been upgraded. Happy Helming! NAME: kube-otel-stack LAST DEPLOYED: Mon Apr 24 11:56:05 2023 NAMESPACE: default STATUS: deployed REVISION: 3 ```

Option where we assume cert-manager is pre-installed

Given most clusters will already have cert-manager installed, here's what the installation process would look like...

A bit smoother, but still the same webhook race condition at the end ### First installation ``` ⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install Release "kube-otel-stack" does not exist. Installing it now. Error: Internal error occurred: failed calling webhook "": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.kube-otel-stack.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp connect: connection refused ``` Trying again after a few seconds... ``` ⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install Release "kube-otel-stack" has been upgraded. Happy Helming! NAME: kube-otel-stack LAST DEPLOYED: Mon Apr 24 12:00:13 2023 NAMESPACE: kube-otel-stack STATUS: deployed REVISION: 2 ```

Proposed remediations

The operator and collector installed together successfully! An end user using this chart could just as easily enable the mutating webhook post-install as well, but that's not an ideal experience IMO.

I would love to hear thoughts on this, and see if there's anything I missed in my findings here. cc @open-telemetry/helm-maintainers

TylerHelmuth commented 1 year ago

For the cert manager my preference would be to copy whatever pattern kube-prometheus-stack is using. If we can't install the cert manager as part of the chart install that will at least follow our existing pattern for the operator, although there is an issue opened about that friction:

Setting the failurePolicy on the MutatingWebhookConfiguration object to Ignore

When I investigated this a while ago this is the solution I stumbled upon and I believe it is the solution that kube-prometheus-stack uses.

jaronoff97 commented 1 year ago

Looking as to what the kube-prometheus-stack does right now.

jaronoff97 commented 1 year ago

It looks like it's configurable (obv) It's default behavior is empty and enabled, which means the policy is going to be set to Ignore so I think that seems reasonable for us to do.

They also recommend pre-installing cert-manager on a cluster to use these webhooks.

TylerHelmuth commented 1 year ago

Seeing as the chart is trying to follow the same pattern for value I think it makes sense to follow the same technical patterns as well.

jaronoff97 commented 1 year ago

Agreed. I can work on it this week and next week to match those expectations. I'll include some docs about these decisions as well.

JaredTan95 commented 7 months ago

I believe it is the solution that kube-prometheus-stack uses.

Yes, Indeed

ferrucc-io commented 4 months ago

Is this something someone is still working on? Given how complex the whole ecosystem was to grasp for me starting out, what would makes the most sense from my perspective is have some way to add presets into the Opentelemetry Operator.

IMO if someone wants to plug in Otel to their cluster most likely they'll want to have the ability to get:

It would be ideal if the default setup of the operator easily allowed you to get a setup like the one Honeycomb suggests in their getting started

jaronoff97 commented 4 months ago

@ferrucc-io yes I'm still working on this, I've had a whole slew of other priorities that keep taking precedence.