open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.45k forks source link

[exporter/prometheusremotewrite] Enabling WAL causes no metrics to be forwarded #6348

Closed ImDevinC closed 2 years ago

ImDevinC commented 2 years ago

Describe the bug When using the prometheusremotewrite exporter with the WAL enabled, no metrics are sent from the collector to the remote write destination.

Steps to reproduce Using the config in the config section below can reproduce this error. Disabling the WAL section causes all metrics to be sent properly.

What did you expect to see? Prometheus metrics should appear in the remote write destination.

What did you see instead? No metrics were sent to the remote write destination

What version did you use? 0.62.1

What config did you use?

clusterRole:
  create: false
config:
  exporters:
    logging:
      loglevel: info
    prometheusremotewrite:
      endpoint: http://thanos-receive-distributor:19291/api/v1/receive
      external_labels:
        cluster: eks-cluster
        environment: jupiterone-dev
        otel_replica: ${replica}
        region: us-east-1
      remote_write_queue:
        enabled: true
        num_consumers: 1
        queue_size: 5000
      resource_to_telemetry_conversion:
        enabled: true
      retry_on_failure:
        enabled: false
        initial_interval: 5s
        max_elapsed_time: 10s
        max_interval: 10s
      target_info:
        enabled: false
      timeout: 15s
      tls:
        insecure: true
      wal:
        buffer_size: 100
        directory: /data/prometheus/wal
        truncate_frequency: 45s
  extensions:
    health_check: {}
    memory_ballast: {}
    pprof:
      endpoint: :1888
    zpages:
      endpoint: :55679
  processors:
    batch/metrics:
      send_batch_max_size: 500
      send_batch_size: 500
      timeout: 180s
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${MY_POD_IP}:8888
  service:
    extensions:
      - health_check
      - pprof
      - zpages
    pipelines:
      metrics:
        exporters:
          - logging
          - prometheusremotewrite
        processors:
          - batch/metrics
        receivers:
          - otlp
    telemetry:
      metrics:
        address: 0.0.0.0:8888
extraEnvs:
  - name: replica
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.name
mode: statefulset
podDisruptionBudget:
  enabled: true
  minAvailable: ""
  maxUnavailable: 1
podMonitor:
  enabled: false
ports:
  healthcheck:
    containerPort: 13133
    enabled: true
    hostPort: 13133
    protocol: TCP
    servicePort: 13133
  jaeger-grpc:
    enabled: false
  jaeger-thrift:
    enabled: false
  metrics:
    containerPort: 8888
    enabled: true
    protocol: TCP
    servicePort: 8888
  otlp:
    enabled: false
  otlp-grpc:
    containerPort: 4317
    enabled: true
    hostPort: 4317
    protocol: TCP
    servicePort: 4317
  otlp-http:
    containerPort: 4318
    enabled: true
    hostPort: 4318
    protocol: TCP
    servicePort: 4318
  prometheus:
    containerPort: 8889
    enabled: true
    hostPort: 8889
    protocol: TCP
    servicePort: 8889
  zipkin:
    enabled: false
prometheusRule:
  defaultRules:
    enabled: true
  enabled: true
replicaCount: 2
resources:
  limits:
    cpu: 1000m
    memory: 6Gi
  requests:
    cpu: 1000m
    memory: 6Gi
service:
  annotations: {}
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: opentelemetry-collector-metrics
serviceMonitor:
  enabled: true
  metricsEndpoints:
    - interval: 5s
      port: metrics
    - interval: 5s
      port: prometheus
extraVolumeMounts:
  - mountPath: /data
    name: otel-metrics-wal
statefulset:
  volumeClaimTemplates:
    - metadata:
        name: otel-metrics-wal
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: gp2

Environment OS: AWS bottlerocket running otel/opentelemetry-collector-contrib:0.36.3 docker image

Additional context From debugging, this looks to be a deadlock between persistToWAL() and readPrompbFromWAL(), but I'm not 100% certain

ImDevinC commented 2 years ago

Wrong repo, moving to the -contrib repository