open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.44k forks source link

otel-collector requires HTTP/2 TLS passthrough from Envoy / Contour: should it? #1916

Closed kevincantu closed 3 months ago

kevincantu commented 4 years ago

I've just gotten started setting up otel-collector for some Kubernetes clusters where we use Envoy (configured via Contour) for routing, and discovered a detail that gave me fits, so I think it's worth laying it all out here. I suspect it may be a gRPC server issue in the collector: some gnarly interaction with Envoy, perhaps?

Expected

What I hoped was that otel-collector could be set up much like this demo with YAGES (a gRPC echo server), where:

I set this up using a Contour HTTPProxy in TCP proxying mode, which relies on SNI to route traffic by domain name:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  selector:
    matchLabels:
      app: yages
  replicas: 1
  template:
    metadata:
      labels:
        app: yages
    spec:
      containers:
      - name: grpcsrv
        image: quay.io/mhausenblas/yages:0.1.0
        ports:
        - containerPort: 9000
          protocol: TCP
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
---
apiVersion: v1
kind: Service
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  ports:
  - name: demo
    port: 55682
    protocol: TCP
    targetPort: 9000
  selector:
    app: yages
---
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  virtualhost:
    fqdn: yages.staging.test
    tls:
      secretName: yages-wildcard
      #passthrough: true
  tcpproxy:
    services:
    - name: yages
      port: 55682
      # tls: HTTP/1 TLS
      # h2:  HTTP/2 TLS
      # h2c: HTTP/2 cleartext
      protocol: h2c

You can exercise that yages app (to send a ping and receive a pong) with the following grpcurl command:

grpcurl --insecure -v yages.staging.test:443 yages.Echo.Ping

I expected routing just like that to work for otel-collector:

Actual

But that didn't work.

Instead, when configuring Envoy (via Contour) like that, I saw TCP events in the Envoy access logs like so, but no success:

[2020-10-01T03:18:09.593Z] "- - -" 0 - 0 15 33 - "-" "-" "-" "-" "172.21.5.170:55680"

My sample app (sending traffic to otel-grpc.staging.test:443) only received StatusCode.UNAVAILABLE error responses! (I extended this part of the opentelemetry-exporter-otlp Python library to log those codes.)

Workaround

To make things work, I had to configure Envoy to pass HTTP/2 TLS traffic to the upstream.

Like so:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector-conf
data:
  otel-collector-config: |
    receivers:
      otlp:
        protocols:
          grpc:
            tls_settings:
              cert_file: /tls/cert.pem
              key_file: /tls/key.pem
          http:
    processors:
      batch:
      memory_limiter:
        # Same as --mem-ballast-size-mib CLI argument
        ballast_size_mib: 1024
        # 80% of maximum memory
        limit_mib: 1600
        # 25% of limit
        spike_limit_mib: 512
        check_interval: 5s
    extensions:
      health_check: {}
      zpages:
        endpoint: "0.0.0.0:55679"  # default was localhost only!
    exporters:
      logging:
        logLevel: debug
      honeycomb:
        api_key: "$HONEYCOMB_API_KEY"
        dataset: "apps"
        api_url: "https://api.honeycomb.io"
    service:
      extensions: [health_check, zpages]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, honeycomb]
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  ports:
  - name: zpages
    port: 55679
    # when proxied: http://localhost:8001/api/v1/namespaces/monitoring/services/http:otel-collector:55679/proxy/debug/tracez
  - name: otlp-grpc # Default endpoint for OpenTelemetry receiver.
    port: 55680
  - name: otlp-http
    port: 55681
  - name: jaeger-grpc # Default endpoing for Jaeger gRPC receiver
    port: 14250
  - name: jaeger-thrift-http # Default endpoint for Jaeger HTTP receiver.
    port: 14268
  - name: zipkin # Default endpoint for Zipkin receiver.
    port: 9411
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 2
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      containers:
      - command:
          - "/otelcontribcol"
          - "--log-level=DEBUG"
          - "--config=/conf/otel-collector-config.yaml"
          # Memory Ballast size should be max 1/3 to 1/2 of memory.
          - "--mem-ballast-size-mib=1024"
        #image: otel/opentelemetry-collector-dev:latest
        image: otel/opentelemetry-collector-contrib:0.11.0
        name: otel-collector
        envFrom:
        - secretRef:
            name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 55679 # Default endpoint for ZPages.
        - containerPort: 55680 # OTLP gRPC receiver.
        - containerPort: 55681 # OTLP HTML/JSON receiver.
        - containerPort: 14250 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 9411  # Default endpoint for Zipkin receiver.
        - containerPort: 8888  # Default endpoint for querying metrics.
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
        - name: otel-tls
          mountPath: /tls
        livenessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
        readinessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
      volumes:
        - name: otel-collector-config-vol
          configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
        - name: otel-tls
          secret:
            secretName: otel-wildcard
            items:
              - key: tls.crt
                path: cert.pem
              - key: tls.key
                path: key.pem
---
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: otel-collector
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "contour"
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  virtualhost:
    fqdn: otel.staging.test
    tls:
      #secretName: otel-wildcard
      passthrough: true
  tcpproxy:
    services:
    - name: otel-collector
      port: 55680
      # tls: HTTP/1 TLS
      # h2:  HTTP/2 TLS
      # h2c: HTTP/2 cleartext
      protocol: h2

That is, in addition to the TLS cert setup for otel-collector, this Contour HTTPProxy config change:

   virtualhost:
...
     tls:
-      secretName: otel-wildcard
+      passthrough: true
   tcpproxy:
     services:
     - name: otel-collector
       port: 55680
       # tls: HTTP/1 TLS
       # h2:  HTTP/2 TLS
       # h2c: HTTP/2 cleartext
-      protocol: h2c
+      protocol: h2

Bug?

Specifically, I found that when routing OTLP (gRPC) traffic wrapped in HTTP/2 TLS:

I think that means that there's something we could do here to make otel-collector's gRPC server play nicely with Envoy!

kevincantu commented 4 years ago

Thanks, by the way, to @pjanotti and @flands who helped me in the Gitter channel, and to this Contour ticket that pointed me at yages!

kevincantu commented 4 years ago

My spidey sense tells me this cmux issue may be related... 🤷‍♀️

andrewcheelightstep commented 3 years ago

Hi folks. Just a quick check to see if there is a timeline with this fix since we are running into this as well.

carlosalberto commented 3 years ago

Hey @kevincantu

As I'm not a Countour expert, I tested against 'vanilla' Envoy and I got it working:

I'm wondering if there's something Contour specific or I'm missing something. Let me know ;)

kevincantu commented 3 years ago

Oh that's encouraging: perhaps something in Envoy 1.16 fixes this? (The version of Contour I last tested with was using an earlier Envoy.)

carlosalberto commented 3 years ago

Hey @kevincantu Any update on this? ;)

dy009 commented 3 years ago

Any update on this, How can i disable the tls ?

kevincantu commented 3 years ago

I'm no longer actively working on the same system which used this, so I haven't spun up a cluster to try any of this out again lately.

What I'd try, though, is setting up something like my example above, with a newer version of Contour (and its corresponding newer version of Envoy), and see whether the workaround I showed is still necessary!

Specifically:

amitgoyal02 commented 1 year ago

How to enable the mTLS for receiver?

atoulme commented 3 months ago

Closing as inactive, please reopen if this is still being worked on.

jmichalek132 commented 1 month ago

I manage to run into this, seems like despite following https://projectcontour.io/docs/main/guides/grpc/ the envoy instance is sending HTTP/1 request to the otel collector instance.

jmichalek132 commented 1 month ago

I manage to run into this, seems like despite following https://projectcontour.io/docs/main/guides/grpc/ the envoy instance is sending HTTP/1 request to the otel collector instance.

I got it working when I switched from using the ingress object to using the contour specific httpproxy object, I'll try to figure out if there's a difference between the configuration they generate.