open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

[exporter/jaeger] unable to send traces to jaeger collector #10360

Closed mrjavaguy closed 2 years ago

mrjavaguy commented 2 years ago

I am guessing this is some sort of networking issue, but I have exhausted all my ideas on fixing, and I am hoping someone here can help.

I am running the OpenTelemetry collector in an EKS 1.19 cluster with the following config (secrets removed):

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: splunk
spec:
  image: otel/opentelemetry-collector-contrib:0.52.0
  ports: 
    - name: prometheusexport
      port: 8080
    - name: zpages
      port: 55679
  config: |
    receivers:
        otlp:
            protocols:
                grpc:
                http:
    exporters:
        splunk_hec/logs:
            token: ""
            endpoint: ""
            source: "k8s:logs"
            sourcetype: "otlp"
            index: "main"
            max_connections: 20
            disable_compression: false
            timeout: 10s
            tls:
              insecure_skip_verify: true
        splunk_hec/metrics:
            token: ""
            endpoint: ""
            source: "k8s:metrics"
            sourcetype: "prometheus"
            index: "metrics"
            max_connections: 20
            disable_compression: false
            timeout: 10s
            tls:
              insecure_skip_verify: true
        jaeger:
          endpoint: "simplest-collector.observability.svc.cluster.local:14250"
          tls:
            insecure: true
        prometheus:
            endpoint: "0.0.0.0:8080"
            send_timestamps: true
            metric_expiration: 180m
            resource_to_telemetry_conversion:
                enabled: true                           
    processors:
        batch:
        memory_limiter:
          # 80% of maximum memory up to 2G
          limit_mib: 400
          # 25% of limit up to 2G
          spike_limit_mib: 100
          check_interval: 5s        

    extensions:
        health_check:
        pprof:
        zpages:
          endpoint: 0.0.0.0:55679

    service:
        extensions: [pprof, zpages, health_check]
        pipelines:
            logs:
                receivers: [otlp]
                exporters: [splunk_hec/logs]
                processors: [memory_limiter, batch]    
            traces:
                receivers: [otlp]
                exporters: [jaeger]
                processors: [memory_limiter, batch]
            metrics:
                receivers: [otlp]
                exporters: [prometheus]
                processors: [memory_limiter, batch]

And the setup I have for Jaeger is

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest
  namespace: observability
spec:
  strategy: allInOne
  allInOne:
    image: jaegertracing/all-in-one:1.34
    options:
      log-level: debug
  ingress:
    enabled: false      

The issue is that I not seeing any traces in Jaeger for my services (logs are showing up in Splunk). If I look at the logs for the OpenTelemetry Collector I see multiple lines of YYYY-MM-DDTHH:MM:SS.mmmZ warn zapgrpc/zapgrpc.go:191 [core] [Server #3] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"POST / HTTP/1.1\\r\\nHost: s\"" {"grpc_log": true}

In searching around I did find an old issues about this but it stated the fix was to use tls: insecure: true on the jaeger exporter, which I have. Any ideas on how to fix?

dmitryax commented 2 years ago

Looks like jaeger is not accessible from the collector. Can you please make sure that simplest-collector.observability.svc.cluster.local:14250 is the right endpoint and it can accept grpc connections from the collector pod?

mrjavaguy commented 2 years ago

I am not sure how I could check the connection from the opentelemetry pod to the jaeger pod. I did try to exec into the opentelemetry pod but it does not seem to have a shell.

I did port-forward kubectl -n observability port-forward svc/simplest-collector 14250

and then use grpcurl

./grpcurl -plaintext -v localhost:14250 list

which did return

grpc.reflection.v1alpha.ServerReflection
jaeger.api_v2.CollectorService
jaeger.api_v2.SamplingManager

On the Jaeger pod logs, I am seeing

{"level":"info","ts":1653520155.064642,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000636f30, {IDLE connection error: desc = \"transport: Error while dialing dial tcp :16685: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1653520155.0646813,"caller":"channelz/funcs.go:340","msg":"[core][Channel #8] Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"warn","ts":1653521597.4371188,"caller":"channelz/funcs.go:342","msg":"[core][Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"\\\\x16\\\\x03\\\\x01\\\\x01\\\\t\\\\x01\\\\x00\\\\x01\\\\x05\\\\x03\\\\x03Ai\\\\x937e\\\\xc4&C\\\\x83pT2#\\\"\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1653521632.432181,"caller":"grpclog/component.go:71","msg":"[transport]transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}

I checked for any NetworkPolicy but there are none.

The pods are running on the same node.

I also checked the AWS security group for EKS, it seemed fine.

jpkrohling commented 2 years ago

If the OpenTelemetry Collector logs are saying that it received bogus payloads from its client, the Jaeger exporter is probably not even being called, as the issue seems to be between your client and the OpenTelemetry Collector. You can check that by replacing the Jaeger exporter with the Logging exporter: if you get the spans printed out to the console, then it's certainly a problem with the Jaeger exporter. If you still get the same error, your problem is happening at an earlier phase.

mrjavaguy commented 2 years ago

It is working in a different cluster so must be an environment. Closing