Open hesamhamdarsi opened 6 months ago
This is certainly odd... have you opened the port in your cluster's firewall rules? We have this recommendation for GKE, but it's probably a similar issue for EKS...
There is no firewalling issue in different layers, none in CNI, none in API-server. worth mentioning the webhook server is already working! for example, if you try to configure opentelemetryCollector with a wrong configuration, it either get error or get mutated by the webhook-server. That means the communication between the k8s-API server and opentelemetry webhook server is already happening. That's why these errors are a bit weird.
i see... I haven't seen this issue before. I wonder if your API server is unhealthy actually given its a bunch of connection resets.
Same problem here
same here
@melquisedequecosta98 @parkedwards can you share:
Hello. I managed to resolve this issue by removing the entire operator and installing it again. Follow the steps:
##############################################################################
operator-sdk olm uninstall
kubectl get mutatingwebhookconfiguration -A
kubectl delete mutatingwebhookconfiguration minstrumentation.kb.io- mopampbridge.kb.io- mopentelemetrycollectorbeta.kb.io-wrrtn mpod.kb.io- (Change for names from your eks)
kubectl get validatingwebhookconfiguration -A
kubetl delete validatingwebhookconfiguration vinstrumentationcreateupdate.kb.io- vinstrumentationdelete.kb.io- vopampbridgecreateupdate.kb.io- vopampbridgedelete.kb.io- vopentelemetrycollectorcreateupdatebeta.kb.io- vopentelemetrycollectordeletebeta.kb.io- (Change for names from your eks)
operator-sdk olm install
##############################################################################
The problem is that "mutatingwebhookconfiguration" and "validatingwebhookconfiguration" were causing some kind of conflict with TLS and are not removed with "operator-sdk olm uninstall" and need to be removed by hand.
The following actions are already done but the issue still exist.
But I found a few more issues related to the same error in the other operators and services (e.g. gatekeeper): https://github.com/open-policy-agent/gatekeeper/issues/2142
the issue seems to be related to the connection pool in the newer golang versions, I am not sure, but at least that's what I've got from tracking multiple issues. https://github.com/golang/go/issues/50984
i am also facing the same issue in GKE, is anyone can help us here,
we are using these versions
helm version:- 3.14 Kubernetes version:- 1.28 Go version:- go1.21.9 kubectl:- 0.26.11 chart-version- 0.62.0
When I was seeing these errors, the we were seeing OOM errors in our deployment of the operator. Once we increase the resources (memory) this issue appears to have gone away
I'm also experiencing this problem in my self-built k8s cluster. opentelemetry-operator-0.110 k8s 1.29 My OpenTelemetryCollector deployment manifest:
# central-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector-trace
namespace: test
spec:
config: |
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 10s
exporters:
debug: {}
otlp:
endpoint: "tempo-sample.monitoring.svc.cluster.local:4317"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug, otlp]
---
apiVersion: opentelemetry.io/v1alpha1 #no v1beta1 v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector-sidecar
namespace: test
spec:
mode: sidecar #设置 mode: sidecar 来指定运行模式
config: |
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
batch: {}
exporters:
debug: {}
otlp:
endpoint: "otel-collector-trace-collector.test.svc.cluster.local:4317"
service:
telemetry:
logs:
level: "debug"
pipelines:
traces:
receivers: [otlp]
processors: []
exporters: [debug, otlp]
---
# 配置自动检测
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: demo-instrumentation
namespace: test
spec:
exporter:
endpoint: "otel-collector-sidecar-collector.test.svc.cluster.local:4317"
propagators:
- tracecontext
- baggage
sampler: #采样器设置
# type: traceidratio
type: always_on # 不过滤
# argument: "0.1" # 10% 的采样率
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
opentelemetry-operator pod log,My error logs I don't understand why this error is reported
{"level":"INFO","timestamp":"2024-11-12T20:43:03.203093427Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:03.209484229Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 20:43:03 http: TLS handshake error from 192.168.219.64:56952: EOF {"level":"INFO","timestamp":"2024-11-12T20:43:03.380465403Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:03.471981416Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:05.300897375Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} {"level":"INFO","timestamp":"2024-11-12T20:43:05.354009727Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:52697: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:12383: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:1836: EOF 2024/11/12 21:00:17 http: TLS handshake error from 192.168.219.64:7338: EOF {"level":"INFO","timestamp":"2024-11-12T21:00:17.331397392Z","logger":"controllers.OpenTelemetryCollector","message":"pdb field is unset in Spec, creating default"} 2024/11/12 21:22:35 http: TLS handshake error from 192.168.219.64:44725: EOF
Component(s)
No response
Describe the issue you're reporting
Description:
We are observing some ssl handshake errors between otel operator on default port 9443 (webhook server) and internal IPs of kubernetes API server.
Steps to reproduce:
Deploying operator helm chart with only a few changes(for our use case) including:
Expected Result:
The opentelemetry operator and collectors works fine, But we are receiving the following logs from the operator pod saying a TLS handshake error happening time to time between API-server and otel operator webhook server. We couldn't see any issue though on validatiingWebhook and MutatingWebhook and they both seems working fine. 10.40.76.248 is the internal service IP of kubernetes API server 10.40.99.143 is the pod ip of the opentelemetry operator
Troubleshooting steps:
To make sure there is no rate limit happening between API server and otel operator, we've checked the API-server logs as well as priority and fairness for handling requests by API-server and we didn't observe anything subspecies behaviour there:
The certificate generated for otel operator also checked and it is valid:
Test environment:
Kubernetes version: v1.27.13-eks-3af4770 Provider: EKS Operator version: 0.96.0