open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
397 stars 485 forks source link

invalid argument issue after 0.69.0 upgrade #1339

Open jihuiyang opened 2 months ago

jihuiyang commented 2 months ago
> kubectl -n otel-collector logs po/otel-collector-collector-zzs7q
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/09/09 23:14:08 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

otel collectors are failing after 0.69.0 upgrade

when i describe the container i can see -confmap.unifyEnvVarExpansion, looks like new version 0.108.0 does not like it

  Containers:
   otc-container:
    Image:       ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-k8s:0.108.0
    Ports:       8888/TCP, 4317/TCP, 4318/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
dosubot[bot] commented 2 months ago

Hey @jihuiyang! I'm here to help you with any bugs, questions, or contributions. Let's tackle this together!

I found a similar open issue that might be relevant to your problem: [operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1. The issue you're encountering with the OpenTelemetry collector failing with featureGate errors is due to changes in the OpenTelemetryCollector CRD. Specifically, the confmap.unifyEnvVarExpansion feature gate is now stable and cannot be disabled.

To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:

  1. Annotate and label the CRDs:

    RELEASE_NAME=my-opentelemetry-operator
    RELEASE_NAMESPACE=opentelemetry-operator-system
    kubectl annotate crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io \
      meta.helm.sh/release-name=${RELEASE_NAME} \
      meta.helm.sh/release-namespace=${RELEASE_NAMESPACE}
    kubectl label crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io app.kubernetes.io/managed-by=Helm
  2. Follow the TLS certificate requirements and upgrade instructions provided in the OpenTelemetry Operator Helm Chart documentation.

Additionally, if you are upgrading from version 0.67 to 0.68, note that the preset.kubernetesEvents preset now excludes DELETED watch types. If you want to keep ingesting DELETED updates for Kubernetes Events, you will need to configure the k8sobjectsreceiver manually.

This issue is a known bug and will be fixed by this PR in the operator: https://github.com/open-telemetry/opentelemetry-operator/pull/3074. This fix will be included in the next operator helm release [1].

To continue talking to Dosu, mention @dosu.

jihuiyang commented 2 months ago

bug still exists in latest operator version

TylerHelmuth commented 2 months ago

@jihuiyang You cannot disable the confmap.unifyEnvVarExpansion feature gate anymore.

Doing --feature-gates= -confmap.unifyEnvVarExpansion is not allowed

jihuiyang commented 2 months ago

I did not specify --feature-gates=-confmap.unifyEnvVarExpansion, all i did was helm install 0.69.0 and somehow it had the this argument

jihuiyang commented 2 months ago

maybe i should uninstall and whole thing and reinstall?

TylerHelmuth commented 2 months ago

@jihuiyang I haven't been able to reproduce your issue yet with a clean install or with a helm upgrade. Can you provide more details?

jaronoff97 commented 2 months ago

@jihuiyang the latest version of the operator should resolve this by removing the gate from a collector (we previously needed to add this to the operator's code to prevent users from config from breaking from a collector change)

Can you link any logs you are seeing from the operator?

jihuiyang commented 2 months ago

I tried a clean install and it worked without the --feature-gates=-confmap.unifyEnvVarExpansion flag. let me try just the upgrade

jaronoff97 commented 2 months ago

@jihuiyang thanks for trying that out. I have been debugging the upgrade process in this issue, if you run into a similar issue during the upgrade I would really appreciate if you could share the steps to reproduce. I've tried a few different ways of doing this and have yet to cause it to happen.

jihuiyang commented 2 months ago

still running into the issue with upgrade, upgrading from 0.65.1 to 0.69

> helm --namespace otel-operator-system ls
NAME                    NAMESPACE               REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
opentelemetry-operator  otel-operator-system    9           2024-09-10 12:16:27.747165 -0700 PDT    deployed    opentelemetry-operator-0.69.0   0.108.0

Collector still sees the featuregate

> kubectl -n otel-collector describe ds/otel-collector-collector | grep feature
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

Operator log

kubectl -n otel-operator-system logs po/opentelemetry-operator-595855cd5c-jx9hj -f
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","message":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.108.0","opentelemetry-collector":"ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-k8s:0.108.0","opentelemetry-targetallocator":"ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.108.0","operator-opamp-bridge":"ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.108.0","auto-instrumentation-java":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5","auto-instrumentation-nodejs":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.52.1","auto-instrumentation-python":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0","auto-instrumentation-dotnet":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0","auto-instrumentation-go":"ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.14.0-alpha","auto-instrumentation-apache-httpd":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","auto-instrumentation-nginx":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","feature-gates":"-operator.golang.flags,operator.observability.prometheus","build-date":"2024-09-05T17:19:14Z","go-version":"go1.22.6","go-arch":"amd64","go-os":"linux","labels-filter":[],"annotations-filter":[],"enable-multi-instrumentation":false,"enable-apache-httpd-instrumentation":true,"enable-dotnet-instrumentation":true,"enable-go-instrumentation":false,"enable-python-instrumentation":true,"enable-nginx-instrumentation":false,"enable-nodejs-instrumentation":true,"enable-java-instrumentation":true,"create-openshift-dashboard":false,"zap-message-key":"message","zap-level-key":"level","zap-time-key":"timestamp","zap-level-format":"uppercase"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"setup","message":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/convert"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Conversion webhook enabled","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-v1-pod"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"setup","message":"starting manager"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.metrics","message":"Starting metrics server"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","message":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.metrics","message":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Starting webhook server"}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
I0910 19:17:10.847314       1 leaderelection.go:254] attempting to acquire leader lease otel-operator-system/9f7554c3.opentelemetry.io...
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.webhook","message":"Serving webhook server","host":"","port":9443}
{"level":"INFO","timestamp":"2024-09-10T19:17:10Z","logger":"controller-runtime.certwatcher","message":"Starting certificate watcher"}
I0910 19:18:05.154198       1 leaderelection.go:268] successfully acquired lease otel-operator-system/9f7554c3.opentelemetry.io
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","logger":"collector-upgrade","message":"looking for managed instances to upgrade"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","logger":"instrumentation-upgrade","message":"looking for managed Instrumentation instances to upgrade"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1alpha1.OpAMPBridge"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ConfigMap"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ServiceAccount"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Service"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1beta1.OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Deployment"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting Controller","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Ingress"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodDisruptionBudget"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceMonitor"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodMonitor"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","message":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-09-10T19:18:05Z","logger":"instrumentation-upgrade","message":"no instances to upgrade"}
{"level":"INFO","timestamp":"2024-09-10T19:18:06Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"INFO","timestamp":"2024-09-10T19:18:06Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}
jaronoff97 commented 2 months ago

if you do this, do you see the string 'managed'?

k get otelcol -n otel-collector otel-collector -o yaml | grep 'managementState'
  managementState: managed
jihuiyang commented 2 months ago

yes i do

> kubectl -n otel-collector get otelcol otel-collector -o yaml | grep 'managementState'
  managementState: managed
fernandonogueira commented 1 week ago

I also experienced this. I had to delete and recreate the OpentelemetryCollector resource (type = sidecar, in my case).