open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
400 stars 490 forks source link

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

Open jlcrow opened 3 months ago

jlcrow commented 3 months ago

Performed a routine helm upgrade from chart version 0.65.1 to 0.68.1 after the upgrade created Open Telemetry collector will not start. No errors in the operator - the collector errors and Crashloops

otel-prometheus-collector-0                        0/1     CrashLoopBackOff   7 (4m20s ago)   15m

Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/08/28 19:23:44 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

Collector config

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
  config: |
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      memory_ballast: {}
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ '0.0.0.0:8888' ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://<internal miimir endpoint>
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "0.0.0.0:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite
dosubot[bot] commented 3 months ago

Hey there, @jlcrow! 👋 I'm here to assist you with any bugs, questions, or contributions you have while waiting for a human maintainer.

The issue you're encountering with the OpenTelemetry collector failing with featureGate errors and crashing in a loop after upgrading to chart version 0.68.1 is due to changes in the OpenTelemetryCollector CRD. Specifically, the confmap.unifyEnvVarExpansion feature gate is now stable and cannot be disabled.

To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:

  1. Annotate and label the CRDs:

    RELEASE_NAME=my-opentelemetry-operator
    RELEASE_NAMESPACE=opentelemetry-operator-system
    kubectl annotate crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io \
      meta.helm.sh/release-name=${RELEASE_NAME} \
      meta.helm.sh/release-namespace=${RELEASE_NAMESPACE}
    kubectl label crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io app.kubernetes.io/managed-by=Helm
  2. Follow the TLS certificate requirements and upgrade instructions provided in the OpenTelemetry Operator Helm Chart documentation.

Additionally, if you are upgrading from version 0.67 to 0.68, note that the preset.kubernetesEvents preset now excludes DELETED watch types. If you want to keep ingesting DELETED updates for Kubernetes Events, you will need to configure the k8sobjectsreceiver manually [1][2][3].

To continue talking to Dosu, mention @dosu.

jaronoff97 commented 3 months ago

@jlcrow this is a know bug and will be fixed by this PR in the operator https://github.com/open-telemetry/opentelemetry-operator/pull/3074

jaronoff97 commented 3 months ago

separately, i would recommend upgrading the collector's version to v1beta1 when you get a chance :)

jaronoff97 commented 2 months ago

solved by https://github.com/open-telemetry/opentelemetry-operator/pull/3074

this will be fixed in the next operator helm release. Thank you for your patience :)

jaronoff97 commented 2 months ago

@jlcrow can you upgrade to latest and let me know if that fixes things?

jlcrow commented 2 months ago

@jaronoff97 Just did a helm repo update open-telemetry tried upgrading to 0.69.0

open-telemetry/opentelemetry-operator   0.69.0          0.108.0     OpenTelemetry Operator Helm chart for Kubernetes

Still seeing errors when the collector comes up

otel-prometheus-collector-0                       0/1     Error       1 (5s ago)    11s
otel-prometheus-targetallocator-7bb6d4d7b-bq8q7   1/1     Running     0             12s
➜  cluster-management git: klon monitoring-system otel-prometheus-collector-0                                                                                                                               
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/09/10 18:14:02 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
jaronoff97 commented 2 months ago

hmm any logs from the operator?

jlcrow commented 2 months ago

@jaronoff97 Nothing on the operator but info logs for the manager container

{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}
jaronoff97 commented 2 months ago

one note, i tried running your config and you should know that the memory_ballast extension is removed. testing this locally now though!

jaronoff97 commented 2 months ago

i saw this message from the otel operator:

{"level":"INFO","timestamp":"2024-09-10T18:41:10Z","logger":"collector-upgrade","message":"instance upgraded","name":"otel-prometheus","namespace":"default","version":"0.108.0"}

and this is working now:

⫸ k logs otel-prometheus-collector-0
2024-09-10T18:41:15.297Z    warn    internal@v0.108.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.    {"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-10T18:41:15.302Z    warn    internal@v0.108.1/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}

Note: the target allocator is failing to startup because it's missing permissions on its service account, but otherwise things worked fully as expected.

jaronoff97 commented 2 months ago

before:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.104.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

After:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.108.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-component.UseLocalHostAsDefaultHost
jlcrow commented 2 months ago

@jaronoff97 Should have provided my latest config:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring-system
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: K8S_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP          
  config:
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: ${K8S_POD_IP}:13133
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ "${K8S_POD_IP}:8888" ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: ${K8S_POD_IP}:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir/api/v1/push
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "${K8S_POD_IP}:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite
jaronoff97 commented 2 months ago

also note, i needed to get rid of the priority class name and the service account name which weren't provided. but thanks for updating, giving it a try...

jaronoff97 commented 2 months ago

yeah i tested going from 0.65.0 -> 0.69.0 which was fully successful with this configuration:

Config
``` apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: otel-prometheus spec: mode: statefulset podAnnotations: sidecar.istio.io/inject: "false" targetAllocator: enabled: true prometheusCR: enabled: true observability: metrics: enableMetrics: true resources: requests: memory: 300Mi cpu: 300m limits: memory: 512Mi cpu: 500m resources: requests: memory: 600Mi cpu: 300m limits: memory: 1Gi cpu: 500m env: - name: K8S_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: K8S_POD_IP valueFrom: fieldRef: fieldPath: status.podIP config: processors: batch: {} memory_limiter: check_interval: 5s limit_percentage: 90 extensions: health_check: endpoint: ${K8S_POD_IP}:13133 receivers: prometheus: config: scrape_configs: - job_name: "otel-collector" scrape_interval: 10s static_configs: - targets: ["${K8S_POD_IP}:8888"] metric_relabel_configs: - action: labeldrop regex: (id|name) - action: labelmap regex: label_(.+) - job_name: kubernetes-nodes-cadvisor bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token honor_timestamps: true kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node regex: (.*) replacement: $$1 - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints selectors: - role: endpoints label: "app.kubernetes.io/name=kube-state-metrics" relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $$1:$$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: exporter_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: service_name - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: node metric_relabel_configs: - source_labels: [__name__] regex: kube_pod_status_(reason|scheduled|ready) action: drop otlp: protocols: grpc: endpoint: ${K8S_POD_IP}:4317 exporters: debug: {} service: telemetry: metrics: address: "${K8S_POD_IP}:8888" level: basic logs: level: "warn" extensions: - health_check pipelines: metrics: receivers: - prometheus - otlp processors: - memory_limiter - batch exporters: - debug --- --- # Source: opentelemetry-kube-stack/templates/clusterrole.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: example-collector rules: - apiGroups: [""] resources: - namespaces - nodes - nodes/proxy - nodes/metrics - nodes/stats - services - endpoints - pods - events - secrets verbs: ["get", "list", "watch"] - apiGroups: ["monitoring.coreos.com"] resources: - servicemonitors - podmonitors verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - apiGroups: - apps resources: - daemonsets - deployments - replicasets - statefulsets verbs: ["get", "list", "watch"] - apiGroups: - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - apiGroups: ["discovery.k8s.io"] resources: - endpointslices verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics", "/metrics/cadvisor"] verbs: ["get"] - apiGroups: - "" resources: - events - namespaces - namespaces/status - nodes - nodes/spec - pods - pods/status - replicationcontrollers - replicationcontrollers/status - resourcequotas - services verbs: - get - list - watch - apiGroups: - apps resources: - daemonsets - deployments - replicasets - statefulsets verbs: - get - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets verbs: - get - list - watch - apiGroups: - batch resources: - jobs - cronjobs verbs: - get - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - get - list - watch - apiGroups: ["events.k8s.io"] resources: ["events"] verbs: ["watch", "list"] --- # Source: opentelemetry-kube-stack/templates/clusterrole.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: example-daemon roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: example-collector subjects: - kind: ServiceAccount # quirk of the Operator name: "otel-prometheus-collector" namespace: default - kind: ServiceAccount name: otel-prometheus-targetallocator namespace: default ```
jlcrow commented 2 months ago

@jaronoff97 idk man the feature gates seem to be sticking around for me when the operator is deploying the collector. I'm running on GKE don't think that should matter though.

  otc-container:
    Container ID:  containerd://724dfd2080e9b46afac3fde71cb9e56747d8c6d352cd7c82b9baf272ed40a301
    Image:         otel/opentelemetry-collector-contrib:0.106.1
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:12a6cab81088666668e312f1e814698f14f205d879181ec5f770301ab17692c2
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
  otc-container:
    Container ID:  containerd://1cf06d1b6368d070ceb3a9f9448351b1638140a459ee9dbb2b9dbf7e3b173610
    Image:         otel/opentelemetry-collector-contrib:0.108.0
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:923eb1cfae32fe09676cfd74762b2b237349f2273888529594f6c6ffe1fb3d7e
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
jaronoff97 commented 2 months ago

what was the version before? I thought it 0.65.1, but want to confirm. And did you install the operator helm chart with upgrades disabled or any other flags? If i can get a local repro, I can try to get a fix out ASAP, otherwise it would be helpful to enable debug logging on the operator.

jlcrow commented 2 months ago

I was able to make it to 0.67.0, any version later breaks the same way

jaronoff97 commented 2 months ago

yeah i just did this exact process:

jaronoff97 commented 2 months ago

another user who reported a similar issue by doing a clean install of the operator https://github.com/open-telemetry/opentelemetry-helm-charts/issues/1339#issuecomment-2341821666

jlcrow commented 2 months ago

@jaronoff97

Looks like a full uninstall and reinstall and now the flag is no longer present and the collector comes up successfully

jaronoff97 commented 2 months ago

okay thats good, but im not satisfied with it. im going to keep investigating here and try to get a repro... im thinking maybe going from an older version to one that adds the flag, back to the previous version and then up to latest may cause it.

jlcrow commented 2 months ago

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

jaronoff97 commented 2 months ago

that's probably due to the permissions change i alluded to here. This was the error message I saw:

{"level":"error","ts":"2024-09-10T18:41:53Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"missing list/watch permissions on the 'namespaces' resource: missing \"list\" permission on resource \"namespaces\" (group: \"\") for all namespaces: missing \"watch\" permission on resource \"namespaces\" (group: \"\") for all namespaces","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:115\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/proc.go:271"}
jaronoff97 commented 2 months ago
apiGroups: [""]
resources:
- namespaces
verbs: ["get", "list", "watch"]

this block should do the trick, but I'm on mobile rn so sorry if it's not exactly right 😅

jlcrow commented 2 months ago

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

I'm still having weird issues with the targetallocator on one of my clusters - it consistently fails to pick up any servicemonitor or podmonitor crds. I tried a number of things including full uninstall and reinstall, working with version 69 of the chart and 108 of the collector. I checked the rbac for the sa account and the auth appears to be there.

kubectl auth can-i get podmonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                             
yes

kubectl auth can-i get servicemonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                   
yes

At the end on a whim I reverted the api back to v1alpha1 and when I deployed the spec and the targetallocator/scrape_configs started showing all the podmonitors and servicemonitors instead of only the default prometheus config that's in the chart. I'm actually not understanding at all why this isn't working correctly as I have another operator on another GKE cluster with the same config that doesn't seem to have an issue with the beta api.