Receiving ssl handshake error from kubernetes APIserver and opentelemetry webhook

hesamhamdarsi commented 6 months ago

Component(s)

No response

Describe the issue you're reporting

Description:

We are observing some ssl handshake errors between otel operator on default port 9443 (webhook server) and internal IPs of kubernetes API server.

Steps to reproduce:

Deploying operator helm chart with only a few changes(for our use case) including:

admissionWebhooks:
  namespaceSelector:
    matchLabels:
      otel-injection: enabled
manager:
  podAnnotations:
    sidecar.istio.io/inject: "false"
  resources:
    limits:
      memory: 256Mi

Expected Result:

The opentelemetry operator and collectors works fine, But we are receiving the following logs from the operator pod saying a TLS handshake error happening time to time between API-server and otel operator webhook server. We couldn't see any issue though on validatiingWebhook and MutatingWebhook and they both seems working fine. 10.40.76.248 is the internal service IP of kubernetes API server 10.40.99.143 is the pod ip of the opentelemetry operator

2024/05/14 09:10:01 http: TLS handshake error from 10.40.76.248:55276: read tcp 10.40.99.143:9443->10.40.76.248:55276: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36546: read tcp 10.40.99.143:9443->10.40.76.248:36546: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36562: read tcp 10.40.99.143:9443->10.40.76.248:36562: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39346: read tcp 10.40.99.143:9443->10.40.76.248:39346: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39360: EOF
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39370: read tcp 10.40.99.143:9443->10.40.76.248:39370: read: connection reset by peer
2024/05/14 09:25:00 http: TLS handshake error from 10.40.76.248:42412: EOF
2024/05/14 09:28:00 http: TLS handshake error from 10.40.76.248:50632: EOF
2024/05/14 09:35:00 http: TLS handshake error from 10.40.76.248:34974: read tcp 10.40.99.143:9443->10.40.76.248:34974: read: connection reset by peer
2024/05/14 09:40:00 http: TLS handshake error from 10.40.76.248:53388: read tcp 10.40.99.143:9443->10.40.76.248:53388: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50526: read tcp 10.40.99.143:9443->10.40.76.248:50526: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50534: read tcp 10.40.99.143:9443->10.40.76.248:50534: read: connection reset by peer
2024/05/14 09:48:00 http: TLS handshake error from 10.40.76.248:39272: EOF
2024/05/14 09:50:00 http: TLS handshake error from 10.40.76.248:33666: read tcp 10.40.99.143:9443->10.40.76.248:33666: read: connection reset by peer

Troubleshooting steps:

To make sure there is no rate limit happening between API server and otel operator, we've checked the API-server logs as well as priority and fairness for handling requests by API-server and we didn't observe anything subspecies behaviour there:

kubectl get flowschemas                                                                                                                                                                                                
NAME                           PRIORITYLEVEL     MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE     MISSINGPL
exempt                         exempt            1                    <none>                2y87d   False
eks-exempt                     exempt            2                    <none>                262d    False
probes                         exempt            2                    <none>                2y87d   False
system-leader-election         leader-election   100                  ByUser                2y87d   False
endpoint-controller            workload-high     150                  ByUser                200d    False
workload-leader-election       leader-election   200                  ByUser                2y87d   False
system-node-high               node-high         400                  ByUser                455d    False
system-nodes                   system            500                  ByUser                2y87d   False
kube-controller-manager        workload-high     800                  ByNamespace           2y87d   False
kube-scheduler                 workload-high     800                  ByNamespace           2y87d   False
kube-system-service-accounts   workload-high     900                  ByNamespace           2y87d   False
eks-workload-high              workload-high     1000                 ByUser                172d    False
service-accounts               workload-low      9000                 ByUser                2y87d   False
global-default                 global-default    9900                 ByUser                2y87d   False
catch-all                      catch-all         10000                ByUser                2y87d   False

kubectl get prioritylevelconfiguration 
NAME              TYPE      NOMINALCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
catch-all         Limited   5                          <none>   <none>     <none>             2y87d
exempt            Exempt    <none>                     <none>   <none>     <none>             2y87d
global-default    Limited   20                         128      6          50                 2y87d
leader-election   Limited   10                         16       4          50                 2y87d
node-high         Limited   40                         64       6          50                 455d
system            Limited   30                         64       6          50                 2y87d
workload-high     Limited   40                         128      6          50                 2y87d
workload-low      Limited   100                        128      6          50                 2y87d

kubectl get --raw /metrics | grep 'apiserver_flowcontrol_request_concurrency_in_use.*workload-low' 
apiserver_flowcontrol_request_concurrency_in_use{flow_schema="service-accounts",priority_level="workload-low"} 0            # current

kubectl get --raw /metrics | grep 'apiserver_flowcontrol_current_inqueue_requests.*workload-low' 
apiserver_flowcontrol_current_inqueue_requests{flow_schema="service-accounts",priority_level="workload-low"} 0              # queue

The certificate generated for otel operator also checked and it is valid:

kubectl get certificate -n monitoring                                                                                                      
NAME                                                READY   SECRET                                                                 AGE
otel-operator-opentelemetry-operator-serving-cert   True    otel-operator-opentelemetry-operator-controller-manager-service-cert   26d

kubectl get secret otel-operator-opentelemetry-operator-controller-manager-service-cert -n monitoring -o jsonpath="{.data['tls\.crt']}" | base64 --decode > cert.crt
openssl x509 -in cert.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            9d:c0:73:fe:ab:4f:b1:1f:a8:24:ee:73:49:23:59:91
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU=otel-operator-opentelemetry-operator
        Validity
            Not Before: Apr  2 15:25:31 2024 GMT
            Not After : Jul  1 15:25:31 2024 GMT
        Subject: OU=otel-operator-opentelemetry-operator
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                # removed to reduce the message size

Test environment:

Kubernetes version: v1.27.13-eks-3af4770 Provider: EKS Operator version: 0.96.0

jaronoff97 commented 6 months ago

This is certainly odd... have you opened the port in your cluster's firewall rules? We have this recommendation for GKE, but it's probably a similar issue for EKS...

hesamhamdarsi commented 6 months ago

There is no firewalling issue in different layers, none in CNI, none in API-server. worth mentioning the webhook server is already working! for example, if you try to configure opentelemetryCollector with a wrong configuration, it either get error or get mutated by the webhook-server. That means the communication between the k8s-API server and opentelemetry webhook server is already happening. That's why these errors are a bit weird.

jaronoff97 commented 6 months ago

i see... I haven't seen this issue before. I wonder if your API server is unhealthy actually given its a bunch of connection resets.

melquisedequecosta98 commented 6 months ago

Same problem here

parkedwards commented 6 months ago

same here

jaronoff97 commented 6 months ago

@melquisedequecosta98 @parkedwards can you share:

versions for kubernetes and operator
operator logs
a confirmation that your CRDs are up to date

melquisedequecosta98 commented 6 months ago

Hello. I managed to resolve this issue by removing the entire operator and installing it again. Follow the steps:

##############################################################################

operator-sdk olm uninstall

kubectl get mutatingwebhookconfiguration -A

kubectl delete mutatingwebhookconfiguration minstrumentation.kb.io- mopampbridge.kb.io- mopentelemetrycollectorbeta.kb.io-wrrtn mpod.kb.io- (Change for names from your eks)

kubectl get validatingwebhookconfiguration -A

kubetl delete validatingwebhookconfiguration vinstrumentationcreateupdate.kb.io- vinstrumentationdelete.kb.io- vopampbridgecreateupdate.kb.io- vopampbridgedelete.kb.io- vopentelemetrycollectorcreateupdatebeta.kb.io- vopentelemetrycollectordeletebeta.kb.io- (Change for names from your eks)

operator-sdk olm install

##############################################################################

The problem is that "mutatingwebhookconfiguration" and "validatingwebhookconfiguration" were causing some kind of conflict with TLS and are not removed with "operator-sdk olm uninstall" and need to be removed by hand.

hesamhamdarsi commented 5 months ago

The following actions are already done but the issue still exist.

Upgrading operator to the latest stable version (v0.101.0)
Removing and reinstalling validating- and mutatingWebhookConfigurations also didn't work

But I found a few more issues related to the same error in the other operators and services (e.g. gatekeeper): https://github.com/open-policy-agent/gatekeeper/issues/2142

the issue seems to be related to the connection pool in the newer golang versions, I am not sure, but at least that's what I've got from tracking multiple issues. https://github.com/golang/go/issues/50984

sunilkumar-nfer commented 5 months ago

i am also facing the same issue in GKE, is anyone can help us here,

we are using these versions

helm version:- 3.14 Kubernetes version:- 1.28 Go version:- go1.21.9 kubectl:- 0.26.11 chart-version- 0.62.0

mveitas commented 4 months ago

When I was seeing these errors, the we were seeing OOM errors in our deployment of the operator. Once we increase the resources (memory) this issue appears to have gone away

csyyy106 commented 1 week ago

I'm also experiencing this problem in my self-built k8s cluster. opentelemetry-operator-0.110 k8s 1.29 My OpenTelemetryCollector deployment manifest:

# central-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-trace
  namespace: test
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
      batch:
        send_batch_size: 10000
        timeout: 10s

    exporters:
      debug:  {}
      otlp:
        endpoint: "tempo-sample.monitoring.svc.cluster.local:4317"
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [debug, otlp]
---
apiVersion: opentelemetry.io/v1alpha1   #no v1beta1   v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-sidecar
  namespace: test
spec:
  mode: sidecar  #设置 mode: sidecar 来指定运行模式
  config: |
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
    processors:
      batch:  {}
    exporters:
      debug:  {}
      otlp:
        endpoint: "otel-collector-trace-collector.test.svc.cluster.local:4317"
    service:
      telemetry:
        logs:
          level: "debug"
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [debug, otlp]

---
# 配置自动检测
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: demo-instrumentation
  namespace: test
spec:
  exporter:
    endpoint: "otel-collector-sidecar-collector.test.svc.cluster.local:4317"
  propagators:
    - tracecontext
    - baggage
  sampler:   #采样器设置
      # type: traceidratio
      type: always_on # 不过滤
      # argument: "0.1"  # 10% 的采样率
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest

csyyy106 commented 1 week ago

opentelemetry-operator pod log，My error logs I don't understand why this error is reported

{"level":"INFO","timestamp":"2024-11-12T20:43:03.203093427Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:03.209484229Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 20:43:03 http: TLS handshake error from 192.168.219.64:56952: EOF {"level":"INFO","timestamp":"2024-11-12T20:43:03.380465403Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:03.471981416Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:05.300897375Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:05.354009727Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:52697: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:12383: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:1836: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:7338: EOF {"level":"INFO","timestamp":"2024-11-12T21:00:17.331397392Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 21:22:35 http: TLS handshake error from 192.168.219.64:44725: EOF

open-telemetry / opentelemetry-operator