open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.19k stars 430 forks source link

TargetAllocator : error during loading configuration #1811

Closed nlamirault closed 1 year ago

nlamirault commented 1 year ago

Hi, i've got these logs at startup:

{"level":"info","ts":"2023-06-05T20:17:18Z","msg":"Starting the Target Allocator"}
{"level":"info","ts":"2023-06-05T20:17:18Z","logger":"allocator","msg":"Unrecognized filter strategy; filtering disabled"}
{"level":"info","ts":"2023-06-05T20:17:18Z","logger":"allocator","msg":"Starting server..."}
{"level":"info","ts":"2023-06-05T20:17:18Z","msg":"Waiting for caches to sync for servicemonitors\n"}
{"level":"info","ts":"2023-06-05T20:17:20Z","msg":"Caches are synced for servicemonitors\n"}
{"level":"info","ts":"2023-06-05T20:17:20Z","msg":"Waiting for caches to sync for podmonitors\n"}
{"level":"info","ts":"2023-06-05T20:17:20Z","msg":"Caches are synced for podmonitors\n"}
{"level":"info","ts":"2023-06-05T20:17:20Z","logger":"allocator","msg":"Successfully started a collector pod watcher","component":"opentelemetry-targetallocator"}
{"level":"error","ts":"2023-06-05T20:17:21Z","logger":"setup","msg":"Unable to load configuration","error":"empty duration string","stacktrace":"main.main.func13\n\t/app/main.go:198\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38"}
{"level":"error","ts":"2023-06-05T20:17:21Z","logger":"setup","msg":"Unable to load configuration","error":"empty duration string","stacktrace":"main.main.func13\n\t/app/main.go:198\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38"}
{"level":"error","ts":"2023-06-05T20:17:22Z","logger":"setup","msg":"Unable to load configuration","error":"empty duration string","stacktrace":"main.main.func13\n\t/app/main.go:198\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38"}

with this configuration:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: opentelemetry-collector
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-collector
    app.kubernetes.io/part-of: opentelemetry-collector
    app.kubernetes.io/version: 1.0.0
    argocd.argoproj.io/instance: opentelemetry-collector
    helm.sh/chart: opentelemetry-collector-1.0.0
  name: metrics
  namespace: opentelemetry
spec:
  config: |
    exporters:
      logging:
        verbosity: normal
      prometheus:
        endpoint: 0.0.0.0:9090
        metric_expiration: 180m
        resource_to_telemetry_conversion:
          enabled: true
      prometheusremotewrite/mimir:
        endpoint: http://mimir-nginx.monitoring.svc.cluster.local:80/api/v1/push
    extensions:
      basicauth/grafanacloud:
        client_auth:
          password: ${GRAFANA_CLOUD_METRICS_APIKEY}
          username: ${GRAFANA_CLOUD_METRICS_ID}
      health_check: null
      memory_ballast:
        size_in_percentage: 20
      pprof:
        endpoint: :1888
      zpages:
        endpoint: :55679
    processors:
      batch:
        send_batch_max_size: 1500
        send_batch_size: 1500
        timeout: 15s
      k8sattributes:
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.node.name
          - k8s.pod.start_time
          - k8s.deployment.name
          - k8s.replicaset.name
          - k8s.replicaset.uid
          - k8s.daemonset.name
          - k8s.daemonset.uid
          - k8s.job.name
          - k8s.job.uid
          - k8s.cronjob.name
          - k8s.statefulset.name
          - k8s.statefulset.uid
          - container.image.tag
          - container.image.name
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.name
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90
        spike_limit_percentage: 30
      resource:
        attributes:
        - action: insert
          key: collector.name
          value: ${KUBE_POD_NAME}
    receivers:
      hostmetrics:
        collection_interval: 60s
        scrapers:
          cpu: null
          disk: null
          filesystem: null
          load: null
          memory: null
          network: null
          processes: null
      prometheus:
        config:
          global:
            evaluation_interval: 60s
            scrape_interval: 60s
            scrape_timeout: 60s
        target_allocator:
          collector_id: ${POD_NAME}
          endpoint: http://metrics-targetallocator:80
          http_sd_config:
            refresh_interval: 60s
          interval: 30s
    service:
      extensions:
      - health_check
      - memory_ballast
      - pprof
      - zpages
      pipelines:
        metrics:
          exporters:
          - logging
          - prometheus
          processors:
          - batch
          - memory_limiter
          - k8sattributes
          receivers:
          - hostmetrics
          - prometheus
      telemetry:
        logs:
          encoding: json
          level: info
        metrics:
          address: 0.0.0.0:8888
          level: detailed
  env:
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  - name: K8S_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: K8S_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  envFrom:
  - secretRef:
      name: opentelemetry-datadog-credentials
  - secretRef:
      name: opentelemetry-lightstep-credentials
  - secretRef:
      name: opentelemetry-grafanacloud-credentials
  image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.78.0
  ingress:
    route: {}
  mode: statefulset
  ports:
  - name: metrics
    port: 8888
    protocol: TCP
    targetPort: 8888
  replicas: 1
  resources:
    limits:
      memory: 3Gi
    requests:
      cpu: "1"
      memory: 2Gi
  serviceAccount: opentelemetry-collector-metrics
  targetAllocator:
    allocationStrategy: consistent-hashing
    enabled: true
    filterStrategy: relabel-config
    image: ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.78.0
    prometheusCR:
      enabled: true
    replicas: 1
    serviceAccount: opentelemetry-collector-metrics-targetallocator
  upgradeStrategy: automatic
eplightning commented 1 year ago

Encountered the same issue today. Seems to be related to some CRD changes: https://github.com/prometheus-operator/prometheus-operator/issues/5197 .

Maybe need to populate this struct with some defaults?

https://github.com/open-telemetry/opentelemetry-operator/blob/a0558ad3e0993479ce913f567c0c5d6d47a04ba6/cmd/otel-allocator/watcher/promOperator.go#LL60C92-L60C92

jaronoff97 commented 1 year ago

From the issue linked, it seems updating your prometheus service monitor CRDs in cluster may be the resolution https://github.com/prometheus-operator/prometheus-operator/issues/5197#issuecomment-1446150799

Please let me know if that works :)

nlamirault commented 1 year ago

I still have the error after applied theses manifests :

- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-alertmanagerconfigs.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-alertmanagers.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-podmonitors.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-probes.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-prometheuses.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-prometheusrules.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-servicemonitors.yaml
- https://raw.githubusercontent.com/prometheus-community/helm-charts/kube-prometheus-stack-46.6.0/charts/kube-prometheus-stack/crds/crd-thanosrulers.yaml

my Prometheus object have theses values:

  evaluationInterval: 30s
  scrapeInterval: 30s
eplightning commented 1 year ago

Same here, using most recent ServiceMonitor/PodMonitor CRDs from prometheus-operator repository.

I'm temporarily running my own build with following change applied and it seems to work. I'm not sure if this is a proper fix however:

    generator, err := prometheus.NewConfigGenerator(log.NewNopLogger(), &monitoringv1.Prometheus{
        Spec: monitoringv1.PrometheusSpec{
            CommonPrometheusFields: monitoringv1.CommonPrometheusFields{
                ScrapeInterval: "30s",
            },
        },
    }, true) // TODO replace Nop?
matej-g commented 1 year ago

Seeing the same after updating to latest TA (0.78.0). I think it's coming from the scrape time validation logic, in which we now have empty durations, since these are now coming from the Prometheus object passed to config generator and are empty.

I think this is right @eplightning, I was about to open a PR, would you like to instead?

eplightning commented 1 year ago

@matej-g Please go ahead, I'm busy with something else at the moment.

mcanevet commented 1 year ago

I'm wondering if it's not due to this line since v0.78.0: https://github.com/open-telemetry/opentelemetry-operator/blob/70f22dc199dd328a70f314e323f95910f8686d3d/cmd/otel-allocator/go.mod#L21, in v0.77.0 it used to be https://github.com/open-telemetry/opentelemetry-operator/blob/0b26cbbfe904281713165235032a2fa351ebfbba/cmd/otel-allocator/go.mod#L21 Prometheus v0.43.0 is very very old. I tried to revert to v0.77.0 but it looks like it does not support Kubernetes 1.27.

jaronoff97 commented 1 year ago

@mcanevet Prometheus v0.43.0 was released in March 2023, whereas the last one we were on was from 2021.

mcanevet commented 1 year ago

Ok, then I must be wrong

jaronoff97 commented 1 year ago

no worries, prometheus version is incredibly aggravating – it seems like they publish v1s and v2s and then retract them once a month!

matej-g commented 1 year ago

Sorry for the delay, in the end I could not exactly pin point where the failure occurred until few more tests (it's actually on the unmarshalling step). Although the fix worked, I wanted to understand where the failure was occurring. See the PR: https://github.com/open-telemetry/opentelemetry-operator/pull/1822

Thanks 🙇

achetronic commented 1 year ago

Same happening here

Thank you @matej-g for the research! I would suggest to merge this workaround until the real root cause is found?

Some suggestions to the code for avoiding to hardcode params

nlamirault commented 1 year ago

with same configuration and with v0.80.0 release, i've got these errors :

{"level":"info","ts":"2023-07-04T15:30:51Z","logger":"opentelemetrycollector-resource","msg":"default","name":"traces"}
{"level":"info","ts":"2023-07-04T15:30:51Z","logger":"opentelemetrycollector-resource","msg":"validate update","name":"traces"}
{"level":"error","ts":"2023-07-04T15:32:28Z","logger":"controllers.OpenTelemetryCollector","msg":"failed to reconcile config maps","error":"failed to parse config: no scrape_configs available as part of the configuration","stacktrace":"github.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).RunTasks\n\t/workspace/controllers/opentelemetrycollector_controller.go:229\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).Reconcile\n\t/workspace/controllers/opentelemetrycollector_controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}
{"level":"error","ts":"2023-07-04T15:32:28Z","msg":"Reconciler error","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","OpenTelemetryCollector":{"name":"metrics","namespace":"opentelemetry"},"namespace":"opentelemetry","name":"metrics","reconcileID":"e53f6432-cbf5-49df-8022-bf1eafa36138","error":"failed to parse config: no scrape_configs available as part of the configuration","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}
achetronic commented 1 year ago

with same configuration and with v0.80.0 release, i've got these errors :

{"level":"info","ts":"2023-07-04T15:30:51Z","logger":"opentelemetrycollector-resource","msg":"default","name":"traces"}
{"level":"info","ts":"2023-07-04T15:30:51Z","logger":"opentelemetrycollector-resource","msg":"validate update","name":"traces"}
{"level":"error","ts":"2023-07-04T15:32:28Z","logger":"controllers.OpenTelemetryCollector","msg":"failed to reconcile config maps","error":"failed to parse config: no scrape_configs available as part of the configuration","stacktrace":"github.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).RunTasks\n\t/workspace/controllers/opentelemetrycollector_controller.go:229\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).Reconcile\n\t/workspace/controllers/opentelemetrycollector_controller.go:211\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}
{"level":"error","ts":"2023-07-04T15:32:28Z","msg":"Reconciler error","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","OpenTelemetryCollector":{"name":"metrics","namespace":"opentelemetry"},"namespace":"opentelemetry","name":"metrics","reconcileID":"e53f6432-cbf5-49df-8022-bf1eafa36138","error":"failed to parse config: no scrape_configs available as part of the configuration","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}

Hi there, you have to set at least one static config on scrape_configs for the prometheus receiver.Commonly it is set to scrape the own collector metrics. With it, it will work

nlamirault commented 1 year ago

i use a ServiceMonitor to scrape the collectors metrics. So i would not like to add the scraping of the own instance

kubectl -n opentelemetry get servicemonitor
NAME                                              AGE
opentelemetry-operator                            25d
opentelemetry-collector-logs                      149m
opentelemetry-collector-metrics-targetallocator   149m
opentelemetry-collector-metrics                   149m
opentelemetry-collector-traces                    149m
achetronic commented 1 year ago

i use a ServiceMonitor to scrape the collectors metrics. So i would not like to add the scraping of the own instance

kubectl -n opentelemetry get servicemonitor
NAME                                              AGE
opentelemetry-operator                            25d
opentelemetry-collector-logs                      149m
opentelemetry-collector-metrics-targetallocator   149m
opentelemetry-collector-metrics                   149m
opentelemetry-collector-traces                    149m

I underdtand but for the target allocator, Prom library is used and then over it some tweaks. A workaround until this discussion is done is the following ;)

# Ref: https://github.com/open-telemetry/opentelemetry-operator/issues/1811
        # Ref: https://github.com/open-telemetry/opentelemetry-operator/pull/1822/files
        # Ref: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14597
        prometheus:
          config:
            global:
              evaluation_interval: 60s
              scrape_interval: 60s
              scrape_timeout: 60s
            scrape_configs:
            - job_name: dummy
              static_configs:
              - targets:
                - 127.0.0.1:8888
          # Query for a list of jobs to target allocator or compatible endpoint
          target_allocator:
            collector_id: ${POD_NAME}
            endpoint: http://global-collector-targetallocator.open-telemetry-collector.svc:80
            interval: 30s
            http_sd_config:
              refresh_interval: 60s

The dummy scrape config make it work. After it, may be fine to open another issue to discuss about this?

nlamirault commented 1 year ago

thanks i will try that...