open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.77k stars 2.19k forks source link

Enabling WAL is not exporting metrics to Mimir backend using Prometheus remote write exporter #33238

Open anushanagireddy0430 opened 2 months ago

anushanagireddy0430 commented 2 months ago

Component(s)

exporter/prometheusremotewrite

What happened?

To enable persistence we are configuring WAL directory in prometheus remote write exporter. Without WAL able to see metrics being sent to mimir and able to see them in grafana using mimir as datasource. When we enable WAL seeing below error in otel collector pod logs.

Steps to Reproduce

  1. Install opentelemetry operator in kubernetes cluster - v1.27.6-gke.2500
  2. Deploy opentelemetry collector CR and required cluster roles and rolebindings otel-collector-cr.txt

Help me understand in fixing this issue. Also anyway to exec into collector pod to check about persisted data of logs and traces.

Collector version

otel/opentelemetry-collector-contrib:0.95.0

Environment information

Red Hat Enterprise Linux Kubernetes environment

OpenTelemetry Collector configuration

apiVersion: v1
items:
- apiVersion: opentelemetry.io/v1alpha1
  kind: OpenTelemetryCollector
  metadata:
    creationTimestamp: "2024-04-10T10:21:48Z"
    generation: 7
    labels:
      app.kubernetes.io/managed-by: opentelemetry-operator
    name: o11y-platform-otel-collector
    namespace: obs11
    resourceVersion: "317990395"
  spec:
    config: |
      connectors:
        spanmetrics:
           exemplars:
             enabled: true
      exporters:
        debug:
          verbosity: detailed
        prometheusremotewrite/dev:
          headers:
            X-Scope-OrgID: o11y-platform
          auth:
            authenticator: basicauth/prw
          endpoint: http://mimir-gateway.eaguann.svc:30015/api/v1/push
          tls:
            insecure: false
            insecure_skip_verify: true
          resource_to_telemetry_conversion:
            enabled: true
          send_metadata: true
          wal:
            directory: ./prom_rw
            buffer_size: 500
          retry_on_failure:
            enabled: true
          timeout: "20s"
          remote_write_queue:
            queue_size: 15000
            num_consumers: 10
        prometheusremotewrite/prod:
          headers:
            X-Scope-OrgID: ecs
          auth:
            authenticator: basicauth/prw
          endpoint: http://xxxxxxxx:80/api/v1/push
          tls:
            insecure: false
            insecure_skip_verify: true
          resource_to_telemetry_conversion:
            enabled: true
          send_metadata: true
        prometheusremotewrite/acceptance:
          headers:
            X-Scope-OrgID: ecs
          auth:
            authenticator: basicauth/prw
          endpoint: http://xxxxxxx:80/api/v1/push
          tls:
            insecure: false
            insecure_skip_verify: true
          resource_to_telemetry_conversion:
            enabled: true
          send_metadata: true
        loki/dev:
          headers:
            X-Scope-OrgID: o11y-platform
          tls:
            insecure: false
            insecure_skip_verify: true
          auth:
            authenticator: basicauth/lokiauth
          endpoint: https://ecs-dev1-loki-ingress.ericsson.com/loki/api/v1/push
          sending_queue:
            storage: file_storage
        loki/prod:
          headers:
            X-Scope-OrgID: ecs
          tls:
            insecure: false
            insecure_skip_verify: true
          auth:
            authenticator: basicauth/lokiauth
          endpoint: http://xxxxxxx:3100/loki/api/v1/push
        loki/acceptance:
          headers:
            X-Scope-OrgID: ecs
          tls:
            insecure: false
            insecure_skip_verify: true
          auth:
            authenticator: basicauth/lokiauth
          endpoint: http://xxxxxxx:3100/loki/api/v1/push
        otlp/dev:
          headers:
            X-Scope-OrgID: o11y-platform
          tls:
            insecure: true
          endpoint: tempo-distributor.ethonag.svc:4317
          sending_queue:
            storage: file_storage
        otlp/prod:
          headers:
            X-Scope-OrgID: ecs
          tls:
            insecure: true
          endpoint: x.x.x.x:4317
        otlp/acceptance:
          headers:
            X-Scope-OrgID: ecs
          tls:
            insecure: true
          endpoint: x.x.x.x:4317
      extensions:
        file_storage:
          directory: "/var/lib/otelcol/file_storage"
        basicauth/prw:
          client_auth:
            username: admin
            password: mimir
        basicauth/lokiauth:
          client_auth:
            username: loki
            password: loki
        basicauth/tempoauth:
          client_auth:
            username: tempo
            password: tempo
        zpages:
          endpoint: "localhost:55679"
        health_check: {}
      service:
        extensions:
        - basicauth/prw
        - basicauth/lokiauth
        - basicauth/tempoauth
        - zpages
        - health_check
        - file_storage
        telemetry:
          metrics:
            address: 127.0.0.1:8888
            level: detailed
          logs:
            level: DEBUG
        pipelines:
          metrics:
            receivers:
            - otlp
            - prometheus
            - k8s_cluster
            - kubeletstats
            - spanmetrics
            - hostmetrics
            exporters:
            - prometheusremotewrite/dev
            - prometheusremotewrite/prod
            - prometheusremotewrite/acceptance
            processors:
            - attributes/metrics
            - k8sattributes
            - resource/metrics
            - batch/metrics
          logs:
            receivers:
            - filelog
            - otlp
            exporters:
            - loki/dev
            - loki/prod
            - loki/acceptance
            processors:
            - resource/logs
          traces:
            receivers:
            - otlp
            exporters:
            - otlp/dev
            - otlp/prod
            - otlp/acceptance
            - spanmetrics
            processors:
            - k8sattributes
            - memory_limiter
            - resource/traces
            - batch/traces
      processors:
        resourcedetection/system:
          detectors: [env, system, gcp, eks]
          timeout: 2s
          override: false
        attributes/metrics:
          actions:
            - action: insert
              key: environment
              value: Development
            - action: insert
              from_attribute: k8s.pod.uid
              key: service.instance.id
            - action: insert
              key: cluster_name
              value: paas-usercluster2
            - action: insert
              key: cloud_provider
              value: GCP
            - action: insert
              key: X_Scope_OrgID
              value: ecs
            - action: insert
              key: cluster
              value: paas-usercluster2
        resource/metrics:
          attributes:
            - action: insert
              key: node
              value: ${env:K8S_NODE_NAME}
        resource/logs:
          attributes:
            - action: insert
              from_attribute: k8s.pod.uid
              key: service.instance.id
            - action: insert
              key: cluster_name
              value: paas-usercluster2
            - action: insert
              key: X_Scope_OrgID
              value: ecs
            - action: insert
              key: cloud_provider
              value: GCP
            - action: insert
              key: environment
              value: Development
            - action: insert
              key: node
              value: ${env:K8S_NODE_NAME}
            - action: insert
              key: loki.format
              value: raw
            - action: insert
              key: loki.resource.labels
              value: pod, namespace, container, filename, cluster_name, X_Scope_OrgID, cloud_provider, environment, node
        resource/traces:
          attributes:
            - action: insert
              from_attribute: k8s.pod.uid
              key: service.instance.id
            - action: insert
              key: cluster_name
              value: paas-usercluster2
            - action: insert
              key: X_Scope_OrgID
              value: ecs
            - action: insert
              key: cloud_provider
              value: GCP
            - action: insert
              key: environment
              value: Development
            - action: insert
              key: node
              value: ${env:K8S_NODE_NAME}
        batch/metrics:
          send_batch_size: 10000
          timeout: 200ms
        batch/traces:
          timeout: 10s
          send_batch_size: 1024
        memory_limiter:
          check_interval: 3s
          limit_mib: 8000
          spike_limit_mib: 2000
        k8sattributes:
          auth_type: "serviceAccount"
          passthrough: true
          filter:
            node_from_env_var: K8S_NODE_NAME
      receivers:
        kubeletstats:
          collection_interval: 10s
          auth_type: "serviceAccount"
          endpoint: ${env:K8S_NODE_NAME}:10250
          insecure_skip_verify: true
          metric_groups:
          - container
          - pod
          - volume
          - node
          extra_metadata_labels:
            - container.id
            - k8s.volume.type
        k8s_cluster:
          auth_type: serviceAccount
          collection_interval: 10s
          node_conditions_to_report: [Ready, MemoryPressure,DiskPressure,NetworkUnavailable]
          allocatable_types_to_report: [cpu, memory, storage, ephemeral-storage]
        k8s_events:
          auth_type : "serviceAccount"
        otlp:
          protocols:
            grpc:
              endpoint: ${env:MY_POD_IP}:4317
            http:
              endpoint: ${env:MY_POD_IP}:4318
        prometheus:
          config:
            scrape_configs:
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: integrations/kubernetes/cadvisor
                kubernetes_sd_configs:
                  - role: node
                relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - regex: (.+)
                    replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
                scheme: https
                tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: false
                  server_name: kubernetes
              - job_name: integrations/kubernetes/kube-state-metrics
                kubernetes_sd_configs:
                  - role: pod
                relabel_configs:
                  - action: keep
                    regex: kube-state-metrics
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_name
                    target_label: namespace
              - job_name: integrations/node_exporter
                kubernetes_sd_configs:
                  - role: pod
                relabel_configs:
                  - action: keep
                    regex: prometheus-node-exporter.*
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_name
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: integrations/kubernetes/kubelet
                kubernetes_sd_configs:
                  - role: node
                relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - regex: (.+)
                    replacement: /api/v1/nodes/$${1}/proxy/metrics
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
                scheme: https
                tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: false
                  server_name: kubernetes
              - job_name: "kubernetes-apiservers"
                kubernetes_sd_configs:
                  - role: endpoints
                    namespaces:
                      names:
                        - default
                scheme: https
                tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: true
                bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                relabel_configs:
                - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
                  action: keep
                  regex: kubernetes;https
                - action: replace
                  source_labels:
                  - __meta_kubernetes_namespace
                  target_label: Namespace
                - action: replace
                  source_labels:
                  - __meta_kubernetes_service_name
                  target_label: Service
              - job_name: 'eaguann/mimir'
                scrape_interval: 5s
                scheme: http
                tls_config:
                  insecure_skip_verify: true
                metrics_path: /metrics
                static_configs:
                  - targets: ['mimir-results-cache.eaguann:9150','mimir-metadata-cache.eaguann:9150','mimir-chunks-cache.eaguann:9150','mimir-index-cache.eaguann:9150','mimir-alertmanager.eaguann:8080','mimir-compactor.eaguann:8080','mimir-distributor.eaguann:8080','mimir-ingester.eaguann:8080','mimir-overrides-exporter.eaguann:8080','mimir-querier.eaguann:8080','mimir-query-frontend.eaguann:8080','mimir-query-scheduler.eaguann:8080','mimir-ruler.eaguann:8080','mimir-store-gateway.eaguann:8080']
              - job_name: "otel-collector"
                scrape_interval: 10s
                static_configs:
                  - targets: ["127.0.0.1:8888"]
              - job_name: 'ecs-monitoring/grafana'
                scrape_interval: 5s
                scheme: https
                tls_config:
                  insecure_skip_verify: true
                metrics_path: /metrics
                static_configs:
                  - targets: ['x.x.x.x:30003']
              - job_name: 'o11y-postgres/postgres'
                scrape_interval: 10s
                scheme: http
                tls_config:
                  insecure_skip_verify: true
                metrics_path: /metrics
                static_configs:
                  - targets: ['postgres-prometheus-postgres-exporter.o11y-postgres.svc:80']
              - job_name: 'ebadsus/loki'
                scrape_interval: 5s
                scheme: http
                tls_config:
                  insecure_skip_verify: true
                metrics_path: /metrics
                static_configs:
                  - targets: ['loki-loki-distributed-distributor.ebadsus:3100','loki-loki-distributed-ingester.ebadsus:3100','loki-loki-distributed-compactor.ebadsus:3100','loki-loki-distributed-index-gateway.ebadsus:3100','loki-loki-distributed-querier.ebadsus:3100','loki-loki-distributed-query-frontend.ebadsus:3100','loki-loki-distributed-query-scheduler.ebadsus:3100','loki-loki-distributed-ruler.ebadsus:3100','loki-loki-distributed-memcached-chunks.ebadsus:9150','loki-loki-distributed-memcached-frontend.ebadsus:9150','loki-loki-distributed-memcached-index-queries.ebadsus:9150']
              - job_name: 'ethonag/tempo'
                scrape_interval: 5s
                scheme: http
                tls_config:
                  insecure_skip_verify: true
                metrics_path: /metrics
                static_configs:
                  - targets: ['tempo-compactor.ethonag:3100','tempo-distributor.ethonag:3100','tempo-ingester.ethonag:3100','tempo-metrics-generator.ethonag:3100','tempo-querier.ethonag:3100', 'tempo-query-frontend.ethonag:3100']
        hostmetrics:
          root_path: /hostfs
          collection_interval: 10s
          scrapers:
            cpu:
              metrics:
                system.cpu.utilization:
                  enabled: true
            disk: null
            load:
            filesystem:
              exclude_fs_types:
                fs_types:
                - autofs
                - binfmt_misc
                - bpf
                - cgroup2
                - configfs
                - debugfs
                - devpts
                - devtmpfs
                - fusectl
                - hugetlbfs
                - iso9660
                - mqueue
                - nsfs
                - overlay
                - proc
                - procfs
                - pstore
                - rpc_pipefs
                - securityfs
                - selinuxfs
                - squashfs
                - sysfs
                - tracefs
                match_type: strict
              exclude_mount_points:
                match_type: regexp
                mount_points:
                - /dev/*
                - /proc/*
                - /sys/*
                - /run/k3s/containerd/*
                - /var/lib/docker/*
                - /var/lib/kubelet/*
                - /snap/*
              metrics:
                system.filesystem.utilization:
                  enabled: true
            memory:
              metrics:
                system.memory.utilization:
                  enabled: true
            network:
            paging:
        filelog:
          include:
           - /var/log/pods/*/*/*.log
          start_at: beginning
          include_file_path: true
          include_file_name: false
          operators:
          - type: router
            id: get-format
            routes:
            - output: parser-docker
              expr: 'body matches "^\\{"'
            - output: parser-crio
              expr: 'body matches "^[^ Z]+ "'
            - output: parser-containerd
              expr: 'body matches "^[^ Z]+Z"'
          - type: regex_parser
            id: parser-crio
            regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout_type: gotime
              layout: '2024-03-11T13:15:05.999999999Z07:00'
          - type: regex_parser
            id: parser-containerd
            regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          - type: json_parser
            id: parser-docker
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          - type: move
            from: attributes.log
            to: body
          - type: regex_parser
            id: extract_metadata_from_filepath
            regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
            parse_from: attributes["log.file.path"]
            cache:
              size: 128
          - type: move
            from: attributes["log.file.path"]
            to: resource["filename"]
          - type: move
            from: attributes.container_name
            to: resource["container"]
          - type: move
            from: attributes.namespace
            to: resource["namespace"]
          - type: move
            from: attributes.pod_name
            to: resource["pod"]
    deploymentUpdateStrategy: {}
    env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    hostNetwork: true
    ingress:
      route: {}
    managementState: managed
    mode: daemonset
    observability:
      metrics: {}
    podDisruptionBudget:
      maxUnavailable: 1
    replicas: 1
    securityContext:
      runAsUser: 0
    resources: {}
    serviceAccount: mimir-collector-collector
    targetAllocator:
      allocationStrategy: consistent-hashing
      filterStrategy: relabel-config
      observability:
        metrics: {}
      prometheusCR:
        scrapeInterval: 30s
      resources: {}
    updateStrategy: {}
    upgradeStrategy: automatic
    volumeMounts:
    - mountPath: /var/log/pods
      name: varlog
      readOnly: true
    - mountPath: /var/lib/docker/containers
      name: varlibdockercontainers
      readOnly: true
    - mountPath: /hostfs
      mountPropagation: HostToContainer
      name: hostfs
      readOnly: true
    - mountPath: /var/lib/otelcol/file_storage
      name: data
    - mountPath: ./prom_rw
      name: metric-data
    volumes:
    - hostPath:
        path: /var/log/pods
      name: varlog
    - hostPath:
        path: /var/lib/docker/containers
      name: varlibdockercontainers
    - hostPath:
        path: /
      name: hostfs
    - name: data
      ephemeral:
        volumeClaimTemplate:
          spec:
            storageClassName: standard-rwo
            resources:
              limits:
                storage: 5Gi
              requests:
                storage: 3Gi
            accessModes:
            - ReadWriteOnce
    - name: metric-data
      ephemeral:
        volumeClaimTemplate:
          spec:
            storageClassName: standard-rwo
            resources:
              limits:
                storage: 5Gi
              requests:
                storage: 3Gi
            accessModes:
            - ReadWriteOnce
  status:
    image: otel/opentelemetry-collector-contrib:0.95.0
    scale:
      selector: app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=obs11.o11y-platform-otel-collector,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=o11y-platform-otel-collector-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=latest
    version: 0.95.0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Log output

2024-05-27T15:07:21.595Z        error   prw.wal prometheusremotewriteexporter@v0.95.0/wal.go:170        error processing WAL entries    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/dev", "error": "not found"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).run.func1
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter@v0.95.0/wal.go:170

Additional context

No response

github-actions[bot] commented 2 months ago

Pinging code owners:

anushanagireddy0430 commented 2 months ago

Any solution or workaround for persistence would be of help!

github-actions[bot] commented 13 hours ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.