open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.75k stars 2.18k forks source link

loadbalancing: Collector fails to start if k8s_resolver encounters issues with watch/list endpoints #33804

Open khyatigandhi0612 opened 3 weeks ago

khyatigandhi0612 commented 3 weeks ago

Component(s)

exporter/loadbalancing

What happened?

Description

The loadbalancing exporter in the OpenTelemetry Collector Contrib package fails to start the collector when using the k8s resolver if it fails to watch/list endpoints. There are continuous errors for the same. This can occur in many scenarios. For example: Missing Role/RoleBinding: The collector pod doesn't have the required role or role binding to access Kubernetes API resources. Incorrect Service Name: The k8s resolver configuration within the loadbalancing exporter specifies an invalid service name. In both cases, the k8s resolver fails to retrieve the target endpoint for trace export, leading to the collector startup failure.

Steps to Reproduce

Deploy an OpenTelemetry collector with the loadbalancing exporter configured to use the k8s resolver. Option 1: Missing Permissions: Do not assign any role or role binding to the collector pod service account. Option 2: Incorrect Service Name: Configure the k8s resolver in the loadbalancing exporter with a non-existent service name. Start the collector deployment.

Expected Result

The OpenTelemetry collector should start successfully even if the k8s resolver initially fails to retrieve the target endpoint due to missing permissions or an incorrect service name. The collector should continue attempting to connect to the k8s API in the background for exporting traces. But other pipelines should function as expected.

Actual Result

The collector fails to start and becomes unavailable for export of other telemetry data in pipeline

Collector version

v0.95.0

Environment information

Kubernetes cluster

OpenTelemetry Collector configuration

exporters:
   debug: {}
   loadbalancing:
     protocol:
       otlp:
        timeout: 10s
        endpoint: localhost
        tls:
          insecure: true
     resolver:
       k8s:
         service: tailsampling-svc.tailsampler
 extensions:
   health_check:
     endpoint: ${env:MY_POD_IP}:13133
 processors:
   batch: {}
   memory_limiter:
     check_interval: 10s
     limit_percentage: 80
     spike_limit_percentage: 25
 receivers:
   jaeger:
     protocols:
       grpc:
         endpoint: ${env:MY_POD_IP}:14250
       thrift_compact:
         endpoint: ${env:MY_POD_IP}:6831
       thrift_http:
         endpoint: ${env:MY_POD_IP}:14268
   otlp:
     protocols:
       grpc:
         endpoint: ${env:MY_POD_IP}:4317
       http:
         endpoint: ${env:MY_POD_IP}:4318
   prometheus:
     config:
       scrape_configs:
       - job_name: opentelemetry-collector
         scrape_interval: 10s
         static_configs:
         - targets:
           - ${env:MY_POD_IP}:8888
   zipkin:
     endpoint: ${env:MY_POD_IP}:9411
 service:
   extensions:
   - health_check
   pipelines:
     logs:
       exporters:
       - debug
       processors:
       - memory_limiter
       - batch
       receivers:
       - otlp
     traces:
       exporters:
       - debug
       - loadbalancing
       processors:
       - memory_limiter
       - batch
       receivers:
       - otlp

Log output

2024-06-28T09:37:04.914Z    info    service@v0.103.0/service.go:115 Setting up own telemetry...
2024-06-28T09:37:04.914Z    info    service@v0.103.0/telemetry.go:96    Serving metrics {"address": ":8888", "level": "Normal"}
2024-06-28T09:37:04.914Z    info    exporter@v0.103.0/exporter.go:280   Development component. May change in the future.    {"kind": "exporter", "data_type": "logs", "name": "debug"}
2024-06-28T09:37:04.914Z    info    exporter@v0.103.0/exporter.go:280   Development component. May change in the future.    {"kind": "exporter", "data_type": "traces", "name": "debug"}
2024-06-28T09:37:04.915Z    info    memorylimiter/memorylimiter.go:160  Using percentage memory limiter {"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "total_memory_mib": 15976, "limit_percentage": 80, "spike_limit_percentage": 25}
2024-06-28T09:37:04.915Z    info    memorylimiter/memorylimiter.go:77   Memory limiter configured   {"kind": "processor", "name": "memory_limiter", "pipeline": "traces", "limit_mib": 12781, "spike_limit_mib": 3994, "check_interval": 10}
2024-06-28T09:37:04.915Z    warn    jaegerreceiver@v0.103.0/factory.go:49   jaeger receiver will deprecate Thrift-gen and replace it with Proto-gen to be compatbible to jaeger 1.42.0 and higher. See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/18485 for more details.   {"kind": "receiver", "name": "jaeger", "data_type": "traces"}
2024-06-28T09:37:04.915Z    info    service@v0.103.0/service.go:182 Starting otelcol-k8s... {"Version": "0.103.1", "NumCPU": 10}
2024-06-28T09:37:04.915Z    info    extensions/extensions.go:34 Starting extensions...
2024-06-28T09:37:04.915Z    info    extensions/extensions.go:37 Extension is starting...    {"kind": "extension", "name": "health_check"}
2024-06-28T09:37:04.915Z    info    healthcheckextension@v0.103.0/healthcheckextension.go:32    Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"10.1.1.32:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-06-28T09:37:04.915Z    info    extensions/extensions.go:52 Extension started.  {"kind": "extension", "name": "health_check"}
2024-06-28T09:37:04.915Z    info    otlpreceiver@v0.103.0/otlp.go:102   Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "10.1.1.32:4317"}
2024-06-28T09:37:04.915Z    info    otlpreceiver@v0.103.0/otlp.go:152   Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "logs", "endpoint": "10.1.1.32:4318"}
W0628 09:37:04.918760       1 reflector.go:539] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
E0628 09:37:04.918786       1 reflector.go:147] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
W0628 09:37:06.037354       1 reflector.go:539] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"
E0628 09:37:06.037425       1 reflector.go:147] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints "tailsampling-svc" is forbidden: User "system:serviceaccount:default:my-opentelemetry-collector" cannot list resource "endpoints" in API group "" in the namespace "tailsampler"

Additional context

No response

github-actions[bot] commented 3 weeks ago

Pinging code owners: