Open jlcrow opened 3 months ago
Hey there, @jlcrow! 👋 I'm here to assist you with any bugs, questions, or contributions you have while waiting for a human maintainer.
The issue you're encountering with the OpenTelemetry collector failing with featureGate errors and crashing in a loop after upgrading to chart version 0.68.1 is due to changes in the OpenTelemetryCollector
CRD. Specifically, the confmap.unifyEnvVarExpansion
feature gate is now stable and cannot be disabled.
To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:
Annotate and label the CRDs:
RELEASE_NAME=my-opentelemetry-operator
RELEASE_NAMESPACE=opentelemetry-operator-system
kubectl annotate crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io \
meta.helm.sh/release-name=${RELEASE_NAME} \
meta.helm.sh/release-namespace=${RELEASE_NAMESPACE}
kubectl label crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io app.kubernetes.io/managed-by=Helm
Follow the TLS certificate requirements and upgrade instructions provided in the OpenTelemetry Operator Helm Chart documentation.
Additionally, if you are upgrading from version 0.67 to 0.68, note that the preset.kubernetesEvents
preset now excludes DELETED
watch types. If you want to keep ingesting DELETED
updates for Kubernetes Events, you will need to configure the k8sobjectsreceiver
manually [1][2][3].
To continue talking to Dosu, mention @dosu.
@jlcrow this is a know bug and will be fixed by this PR in the operator https://github.com/open-telemetry/opentelemetry-operator/pull/3074
separately, i would recommend upgrading the collector's version to v1beta1 when you get a chance :)
solved by https://github.com/open-telemetry/opentelemetry-operator/pull/3074
this will be fixed in the next operator helm release. Thank you for your patience :)
@jlcrow can you upgrade to latest and let me know if that fixes things?
@jaronoff97 Just did a helm repo update open-telemetry tried upgrading to 0.69.0
open-telemetry/opentelemetry-operator 0.69.0 0.108.0 OpenTelemetry Operator Helm chart for Kubernetes
Still seeing errors when the collector comes up
otel-prometheus-collector-0 0/1 Error 1 (5s ago) 11s
otel-prometheus-targetallocator-7bb6d4d7b-bq8q7 1/1 Running 0 12s
➜ cluster-management git: klon monitoring-system otel-prometheus-collector-0
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/09/10 18:14:02 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
hmm any logs from the operator?
@jaronoff97 Nothing on the operator but info logs for the manager container
{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}
one note, i tried running your config and you should know that the memory_ballast extension is removed. testing this locally now though!
i saw this message from the otel operator:
{"level":"INFO","timestamp":"2024-09-10T18:41:10Z","logger":"collector-upgrade","message":"instance upgraded","name":"otel-prometheus","namespace":"default","version":"0.108.0"}
and this is working now:
⫸ k logs otel-prometheus-collector-0
2024-09-10T18:41:15.297Z warn internal@v0.108.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-10T18:41:15.302Z warn internal@v0.108.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
Note: the target allocator is failing to startup because it's missing permissions on its service account, but otherwise things worked fully as expected.
before:
Containers:
otc-container:
Image: otel/opentelemetry-collector-k8s:0.104.0
Ports: 8888/TCP, 4317/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--config=/conf/collector.yaml
--feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
After:
Containers:
otc-container:
Image: otel/opentelemetry-collector-k8s:0.108.0
Ports: 8888/TCP, 4317/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--config=/conf/collector.yaml
--feature-gates=-component.UseLocalHostAsDefaultHost
@jaronoff97 Should have provided my latest config:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-prometheus
namespace: monitoring-system
spec:
mode: statefulset
podAnnotations:
sidecar.istio.io/inject: "false"
targetAllocator:
serviceAccount: opentelemetry-targetallocator-sa
enabled: true
prometheusCR:
enabled: true
observability:
metrics:
enableMetrics: true
resources:
requests:
memory: 300Mi
cpu: 300m
limits:
memory: 512Mi
cpu: 500m
priorityClassName: highest-priority
resources:
requests:
memory: 600Mi
cpu: 300m
limits:
memory: 1Gi
cpu: 500m
env:
- name: K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
config:
processors:
batch: {}
memory_limiter:
check_interval: 5s
limit_percentage: 90
extensions:
health_check:
endpoint: ${K8S_POD_IP}:13133
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: [ "${K8S_POD_IP}:8888" ]
metric_relabel_configs:
- action: labeldrop
regex: (id|name)
- action: labelmap
regex: label_(.+)
- job_name: kubernetes-nodes-cadvisor
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_timestamps: true
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: node
regex: (.*)
replacement: $$1
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
selectors:
- role: endpoints
label: "app.kubernetes.io/name=kube-state-metrics"
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: exporter_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: service_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: node
metric_relabel_configs:
- source_labels: [__name__]
regex: kube_pod_status_(reason|scheduled|ready)
action: drop
otlp:
protocols:
grpc:
endpoint: ${K8S_POD_IP}:4317
exporters:
prometheusremotewrite:
endpoint: https://mimir/api/v1/push
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 10s
max_elapsed_time: 30s
service:
telemetry:
metrics:
address: "${K8S_POD_IP}:8888"
level: basic
logs:
level: "warn"
extensions:
- health_check
pipelines:
metrics:
receivers:
- prometheus
- otlp
processors:
- memory_limiter
- batch
exporters:
- prometheusremotewrite
also note, i needed to get rid of the priority class name and the service account name which weren't provided. but thanks for updating, giving it a try...
yeah i tested going from 0.65.0 -> 0.69.0 which was fully successful with this configuration:
@jaronoff97 idk man the feature gates seem to be sticking around for me when the operator is deploying the collector. I'm running on GKE don't think that should matter though.
otc-container:
Container ID: containerd://724dfd2080e9b46afac3fde71cb9e56747d8c6d352cd7c82b9baf272ed40a301
Image: otel/opentelemetry-collector-contrib:0.106.1
Image ID: docker.io/otel/opentelemetry-collector-contrib@sha256:12a6cab81088666668e312f1e814698f14f205d879181ec5f770301ab17692c2
Ports: 8888/TCP, 4317/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--config=/conf/collector.yaml
--feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
otc-container:
Container ID: containerd://1cf06d1b6368d070ceb3a9f9448351b1638140a459ee9dbb2b9dbf7e3b173610
Image: otel/opentelemetry-collector-contrib:0.108.0
Image ID: docker.io/otel/opentelemetry-collector-contrib@sha256:923eb1cfae32fe09676cfd74762b2b237349f2273888529594f6c6ffe1fb3d7e
Ports: 8888/TCP, 4317/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--config=/conf/collector.yaml
--feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
what was the version before? I thought it 0.65.1, but want to confirm. And did you install the operator helm chart with upgrades disabled or any other flags? If i can get a local repro, I can try to get a fix out ASAP, otherwise it would be helpful to enable debug logging on the operator.
I was able to make it to 0.67.0, any version later breaks the same way
yeah i just did this exact process:
-confmap.unifyEnvVarExpansion
featuregate on it whereas mine does not. If you delete and recreate the otelcol object is it still present? Another option would be to upgrade to operator 0.69.0, and then delete recreate the otelcol object at which point it should be gone... If that doesn't work or isn't possible let me know and we can sort out some other options.another user who reported a similar issue by doing a clean install of the operator https://github.com/open-telemetry/opentelemetry-helm-charts/issues/1339#issuecomment-2341821666
@jaronoff97
Looks like a full uninstall and reinstall and now the flag is no longer present and the collector comes up successfully
okay thats good, but im not satisfied with it. im going to keep investigating here and try to get a repro... im thinking maybe going from an older version to one that adds the flag, back to the previous version and then up to latest may cause it.
@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again
that's probably due to the permissions change i alluded to here. This was the error message I saw:
{"level":"error","ts":"2024-09-10T18:41:53Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"missing list/watch permissions on the 'namespaces' resource: missing \"list\" permission on resource \"namespaces\" (group: \"\") for all namespaces: missing \"watch\" permission on resource \"namespaces\" (group: \"\") for all namespaces","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:115\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/proc.go:271"}
apiGroups: [""]
resources:
- namespaces
verbs: ["get", "list", "watch"]
this block should do the trick, but I'm on mobile rn so sorry if it's not exactly right 😅
@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again
I'm still having weird issues with the targetallocator on one of my clusters - it consistently fails to pick up any servicemonitor or podmonitor crds. I tried a number of things including full uninstall and reinstall, working with version 69 of the chart and 108 of the collector. I checked the rbac for the sa account and the auth appears to be there.
kubectl auth can-i get podmonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator
yes
kubectl auth can-i get servicemonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator
yes
At the end on a whim I reverted the api back to v1alpha1 and when I deployed the spec and the targetallocator/scrape_configs started showing all the podmonitors and servicemonitors instead of only the default prometheus config that's in the chart. I'm actually not understanding at all why this isn't working correctly as I have another operator on another GKE cluster with the same config that doesn't seem to have an issue with the beta api.
Performed a routine helm upgrade from chart version 0.65.1 to 0.68.1 after the upgrade created Open Telemetry collector will not start. No errors in the operator - the collector errors and Crashloops
Collector config