Open zhulei-pacvue opened 5 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Hello @zhulei-pacvue, did this error happen on startup, or after the collector had been running for some time? Have you seen this happen repeatedly when running the collector, or was this only one time? Can you share more about what kind of environment the collector was running in?
@crobert-1 Thank you! This error often occurs after the collector has been running for some time. After the service is restarted, it can run normally.
We're experiencing the same issue:
error prometheusexporter@v0.99.0/log.go:23 error encoding and sending metric family: write tcp 100.78.45.37:8889->100.78.6.13:46202: write: broken pipe
{"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.99.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
go.opentelemetry.io/collector/config/confighttp@v0.99.0/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:214
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:72
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
go.opentelemetry.io/collector/config/confighttp@v0.99.0/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
net/http/server.go:3137
net/http.(*conn).serve
net/http/server.go:2039
Usually there's about 10 or 20 such logs that happen at about the same time. Some of then happen at the exact same time, while others are a few ms apart.
In the last week alone, this happens about once or twice a day:
We're running this on OpenShift using Kubernetes v1.26.13+8f85140.
The OpenTelemetry collector runs as a container in a pod with a Quarkus service sending metrics and traces to the OpenTelemetry Collector.
Here is the Kubernetes Deployment resource:
kind: Deployment
apiVersion: apps/v1
metadata:
annotations:
# ...
name: service
namespace: my-namespace
labels:
# ...
spec:
replicas: 2
selector:
matchLabels:
# ...
template:
metadata:
labels:
# ...
annotations:
# ...
prometheus.io/path: /metrics
prometheus.io/port: '8889'
prometheus.io/scrape: 'true'
spec:
volumes:
- name: otel-config
configMap:
name: otel-config
defaultMode: 420
- name: opa-conf
configMap:
name: opa
defaultMode: 420
containers:
- resources:
limits:
cpu: 500m
memory: 1536Mi
requests:
cpu: 10m
memory: 1536Mi
readinessProbe:
httpGet:
path: /q/health/ready
port: container-port
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 10
periodSeconds: 7
successThreshold: 1
failureThreshold: 7
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
exec:
command:
- sleep
- '5'
name: service
livenessProbe:
httpGet:
path: /q/health/live
port: container-port
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
env:
# ...
ports:
- name: container-port
containerPort: 8080
protocol: TCP
imagePullPolicy: IfNotPresent
terminationMessagePolicy: File
image: my-service
- resources:
limits:
cpu: 10m
memory: 150Mi
requests:
cpu: 10m
memory: 150Mi
readinessProbe:
httpGet:
path: /
port: otel-health
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- >-
while curl -X GET http://localhost:8080/q/health/live ; do
sleep 1; done
name: otel-collector
livenessProbe:
httpGet:
path: /
port: otel-health
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
env:
- name: TZ
value: Europe/Copenhagen
ports:
- name: otel-health
containerPort: 13133
protocol: TCP
- name: otlp-grpc
containerPort: 4317
protocol: TCP
- name: otlp-http
containerPort: 4318
protocol: TCP
- name: jaeger-grpc
containerPort: 14260
protocol: TCP
- name: jaeger-http
containerPort: 14268
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: otel-config
mountPath: /etc/config/otel
terminationMessagePolicy: File
# We use RedHat's ubi8-minimal as a base image, and install OpenTelemetry otelcol_VERSION_linux_amd64.rpm from https://github.com/open-telemetry/opentelemetry-collector-releases/releases onto it
image: >-
my-repo/opentelemetry-collector:0.99.0
args:
- otelcol
- '--config=/etc/config/otel/otel-collector-config.yaml'
- resources:
limits:
cpu: 50m
memory: 150Mi
requests:
cpu: 10m
memory: 150Mi
readinessProbe:
httpGet:
path: /health?bundle=true
port: opa-port
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 10
periodSeconds: 7
successThreshold: 1
failureThreshold: 7
terminationMessagePath: /dev/termination-log
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- >-
while curl -X GET http://localhost:8080/q/health/live ; do
sleep 1; done
name: opa
livenessProbe:
httpGet:
path: /health
port: opa-port
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
env:
- name: TZ
value: Europe/Copenhagen
ports:
- name: opa-port
containerPort: 8181
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: opa-conf
readOnly: true
mountPath: /opa-conf
terminationMessagePolicy: File
image: >-
opa:0.63.0
args:
- run
- '--ignore=.*'
- '--server'
- '--log-level'
- error
- '--config-file'
- /opa-conf/opa-conf.yaml
- '--watch'
restartPolicy: Always
terminationGracePeriodSeconds: 30
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
The OpenTelemetry Collector config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: "opa"
scrape_interval: 15s
metrics_path: "/metrics"
static_configs:
- targets:
- "0.0.0.0:8181"
otlp:
protocols:
grpc:
endpoint:
0.0.0.0:4317
http:
endpoint:
0.0.0.0:4318
jaeger:
protocols:
grpc:
endpoint:
0.0.0.0:14260
thrift_http:
endpoint:
0.0.0.0:14268
processors:
batch:
timeout: 1s
exporters:
prometheus:
add_metric_suffixes: false
endpoint:
0.0.0.0:8889
otlp:
endpoint: "ose-jaeger-collector-headless.ose-jaeger.svc.cluster.local:4317"
tls:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
insecure: false
min_version: "1.2"
extensions:
health_check:
endpoint:
0.0.0.0:13133
service:
extensions: [ health_check ]
pipelines:
metrics:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ prometheus ]
traces:
receivers: [ otlp, jaeger ]
processors: [ batch ]
exporters: [ otlp ]
telemetry:
metrics:
address: 0.0.0.0:8888
CPU and memory usage of the OpenTelemetry Collector shows nothing abnormal when the error occurs:
Same issue since we upgraded from 0.88.0 to 0.100.0
Hi @crobert-1 , is there any ongoing effort to fix/mitigate this issue? we're have a deployment of otel-collector in k8s and we face the exact same issue, same stacktrace with the prometheusexporter error:
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.87.0/log.go:23
We also tried to upgrade to latest release 0.97.0 - didn't change.
We have the otel-collector running on a dedicated node without hard limit, only requests:
resources:
requests:
cpu: 32
memory: 128Gi
Memory usage & CPU usage are well under limit. Mem is 12%, CPU is 6%.
Seems it starts happening when we increase the metrics ingest volume. We have clients sending metrics every 1 min, and what caused this error to start appearing is increasing the number of clients.
Is there any fix/mitigation for this issue? via a param or config change?
Thanks
My apologies, I'm largely unfamiliar with this component and its functionality. I'm not aware of any ongoing effort to address this. @Aneurysm9 do you have any suggestions here?
Adding an extra datapoint. I stumbled upon this issue when I configured a k8s livenessProbe pointing to the prometheusexport endpoint.
Component(s)
exporter/prometheus
Describe the issue you're reporting
When I use Prometheusexporter, otelcol frequently reports errors as follows:
2024-04-15T01:39:07.597Z error prometheusexporter@v0.97.0/log.go:23 error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe {"kind": "exporter", "data_type": "metrics", "name": "prometheus"} github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(promLogger).Println github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.97.0/log.go:23 github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2 github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192 github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1 github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210 net/http.HandlerFunc.ServeHTTP net/http/server.go:2166 net/http.(ServeMux).ServeHTTP net/http/server.go:2683 go.opentelemetry.io/collector/config/confighttp.(decompressor).ServeHTTP go.opentelemetry.io/collector/config/confighttp@v0.97.0/compression.go:160 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(middleware).serveHTTP go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:225 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:83 net/http.HandlerFunc.ServeHTTP net/http/server.go:2166 go.opentelemetry.io/collector/config/confighttp.(clientInfoHandler).ServeHTTP go.opentelemetry.io/collector/config/confighttp@v0.97.0/clientinfohandler.go:26 net/http.serverHandler.ServeHTTP net/http/server.go:3137 net/http.(conn).serve net/http/server.go:2039
Version: otelcol:0.97.0
Importent configurations: