lightstep / otel-collector-charts

This is the repository for Lightstep's recommendations for running an OpenTelemetry Collector.
Apache License 2.0
15 stars 12 forks source link

metrics-collector should have logs if in the failure of connecting to target allocator #51

Closed Toaddyan closed 11 months ago

Toaddyan commented 1 year ago
metricsCollector:
  targetallocator:
      limits:
        cpu: 1000m
        memory: 4000Mi
      requests:
        cpu: 500m
        memory: 2000Mi
  config:
    extensions:
      health_check:
        endpoint: "0.0.0.0:13133"
        path: "/"
        check_collector_pipeline:
          enabled: false
          interval: "5m"
          exporter_failure_threshold: 5
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
          http:
            endpoint: "0.0.0.0:4318"
    processors:
      metricstransform/k8sservicename:
        transforms:
          - include: kube_service_info
            match_type: strict
            action: update
            operations:
              - action: update_label
                label: service
                new_label: k8s.service.name
      resourcedetection/env:
        detectors:
          - env
        timeout: 2s
        override: false
      k8sattributes:
        passthrough: false
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.name
      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500
      resource:
        attributes:
          - key: lightstep.helm_chart
            value: kube-otel-stack
            action: insert
          - key: job
            from_attribute: service.name
            action: insert
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir/api/v1/push
        headers:
          "X-Scope-OrgID": "ORG_ID"
        external_labels:
          cluster: cluster_name
    service:
      extensions:
        - health_check
      pipelines:
        metrics:
          receivers:
            - prometheus
          processors:
            - resource
            - resourcedetection/env
            - k8sattributes
            - metricstransform/k8sservicename
            - batch
          exporters:
            - prometheusremotewrite

situation:

kube-proxy went down causing (this was not known initially at time of debugging). When the metrics collector was trying to connect to the target allocator, the health check extension would kill the pod before any log message would come out of the metrics collector because it could not reach the service.

Problem

there was no log to tell me any symptom of failure besides a "timeout"

ask

some kind of logging mechanism that isn't hidden by health check extension would be nice.

Toaddyan commented 1 year ago

https://cloud-native.slack.com/archives/C03HVLM8LAH/p1698257127836959 more context on issue

jaronoff97 commented 11 months ago

@Toaddyan should i close this or move it to the operator?

jaronoff97 commented 11 months ago

IIRC the issue was that the liveness probe failing prevented logs from showing or something?

Toaddyan commented 11 months ago

yea, that's correct. we can move to operator if you think it's appropriate. this isn't blocking me anymore so I'm ok with closing too.

jaronoff97 commented 11 months ago

yeah we can just close it for now :) Thanks.