error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe

zhulei-pacvue commented 5 months ago

Component(s)

exporter/prometheus

Describe the issue you're reporting

When I use Prometheusexporter, otelcol frequently reports errors as follows:

2024-04-15T01:39:07.597Z error prometheusexporter@v0.97.0/log.go:23 error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe {"kind": "exporter", "data_type": "metrics", "name": "prometheus"} github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(promLogger).Println github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.97.0/log.go:23 github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2 github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192 github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1 github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210 net/http.HandlerFunc.ServeHTTP net/http/server.go:2166 net/http.(ServeMux).ServeHTTP net/http/server.go:2683 go.opentelemetry.io/collector/config/confighttp.(decompressor).ServeHTTP go.opentelemetry.io/collector/config/confighttp@v0.97.0/compression.go:160 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(middleware).serveHTTP go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:225 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/handler.go:83 net/http.HandlerFunc.ServeHTTP net/http/server.go:2166 go.opentelemetry.io/collector/config/confighttp.(clientInfoHandler).ServeHTTP go.opentelemetry.io/collector/config/confighttp@v0.97.0/clientinfohandler.go:26 net/http.serverHandler.ServeHTTP net/http/server.go:3137 net/http.(conn).serve net/http/server.go:2039

Version： otelcol：0.97.0

Importent configurations:

exporters:
  otlp:
    endpoint: 'jaeger-collector:4317'
    tls:
      insecure: true  
  prometheus:
    endpoint: ${env:MY_POD_IP}:8889
    namespace:
    send_timestamps: false
    metric_expiration: 10m
    add_metric_suffixes: false

github-actions[bot] commented 5 months ago

Pinging code owners:

exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 5 months ago

Hello @zhulei-pacvue, did this error happen on startup, or after the collector had been running for some time? Have you seen this happen repeatedly when running the collector, or was this only one time? Can you share more about what kind of environment the collector was running in?

zhulei-pacvue commented 5 months ago

@crobert-1 Thank you! This error often occurs after the collector has been running for some time. After the service is restarted, it can run normally.

AndreasPetersen commented 4 months ago

We're experiencing the same issue:

error   prometheusexporter@v0.99.0/log.go:23    error encoding and sending metric family: write tcp 100.78.45.37:8889->100.78.6.13:46202: write: broken pipe
    {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
    github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.99.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
    github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
    github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
    net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
    go.opentelemetry.io/collector/config/confighttp@v0.99.0/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
    go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:214
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
    go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:72
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
    go.opentelemetry.io/collector/config/confighttp@v0.99.0/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
    net/http/server.go:3137
net/http.(*conn).serve
    net/http/server.go:2039

Usually there's about 10 or 20 such logs that happen at about the same time. Some of then happen at the exact same time, while others are a few ms apart.

In the last week alone, this happens about once or twice a day:

We're running this on OpenShift using Kubernetes v1.26.13+8f85140.

The OpenTelemetry collector runs as a container in a pod with a Quarkus service sending metrics and traces to the OpenTelemetry Collector.

Here is the Kubernetes Deployment resource:

kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    # ...
  name: service
  namespace: my-namespace
  labels:
    # ...
spec:
  replicas: 2
  selector:
    matchLabels:
      # ...
  template:
    metadata:
      labels:
        # ...
      annotations:
        # ...
        prometheus.io/path: /metrics
        prometheus.io/port: '8889'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: otel-config
          configMap:
            name: otel-config
            defaultMode: 420
        - name: opa-conf
          configMap:
            name: opa
            defaultMode: 420
      containers:
        - resources:
            limits:
              cpu: 500m
              memory: 1536Mi
            requests:
              cpu: 10m
              memory: 1536Mi
          readinessProbe:
            httpGet:
              path: /q/health/ready
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '5'
          name: service
          livenessProbe:
            httpGet:
              path: /q/health/live
              port: container-port
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            # ...
          ports:
            - name: container-port
              containerPort: 8080
              protocol: TCP
          imagePullPolicy: IfNotPresent
          terminationMessagePolicy: File
          image: my-service
        - resources:
            limits:
              cpu: 10m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: otel-collector
          livenessProbe:
            httpGet:
              path: /
              port: otel-health
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: otel-health
              containerPort: 13133
              protocol: TCP
            - name: otlp-grpc
              containerPort: 4317
              protocol: TCP
            - name: otlp-http
              containerPort: 4318
              protocol: TCP
            - name: jaeger-grpc
              containerPort: 14260
              protocol: TCP
            - name: jaeger-http
              containerPort: 14268
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: otel-config
              mountPath: /etc/config/otel
          terminationMessagePolicy: File
          # We use RedHat's ubi8-minimal as a base image, and install OpenTelemetry otelcol_VERSION_linux_amd64.rpm from https://github.com/open-telemetry/opentelemetry-collector-releases/releases onto it
          image: >-
            my-repo/opentelemetry-collector:0.99.0
          args:
            - otelcol
            - '--config=/etc/config/otel/otel-collector-config.yaml'
        - resources:
            limits:
              cpu: 50m
              memory: 150Mi
            requests:
              cpu: 10m
              memory: 150Mi
          readinessProbe:
            httpGet:
              path: /health?bundle=true
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 10
            periodSeconds: 7
            successThreshold: 1
            failureThreshold: 7
          terminationMessagePath: /dev/termination-log
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - >-
                    while curl -X GET http://localhost:8080/q/health/live ; do
                    sleep 1; done
          name: opa
          livenessProbe:
            httpGet:
              path: /health
              port: opa-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: TZ
              value: Europe/Copenhagen
          ports:
            - name: opa-port
              containerPort: 8181
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: opa-conf
              readOnly: true
              mountPath: /opa-conf
          terminationMessagePolicy: File
          image: >-
            opa:0.63.0
          args:
            - run
            - '--ignore=.*'
            - '--server'
            - '--log-level'
            - error
            - '--config-file'
            - /opa-conf/opa-conf.yaml
            - '--watch'
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

The OpenTelemetry Collector config:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "opa"
          scrape_interval: 15s
          metrics_path: "/metrics"
          static_configs:
            - targets:
                - "0.0.0.0:8181"
  otlp:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:4317
      http:
        endpoint:
          0.0.0.0:4318
  jaeger:
    protocols:
      grpc:
        endpoint:
          0.0.0.0:14260
      thrift_http:
        endpoint:
          0.0.0.0:14268
processors:
  batch:
    timeout: 1s
exporters:
  prometheus:
    add_metric_suffixes: false
    endpoint:
      0.0.0.0:8889
  otlp:
    endpoint: "ose-jaeger-collector-headless.ose-jaeger.svc.cluster.local:4317"
    tls:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
      insecure: false
      min_version: "1.2"
extensions:
  health_check:
    endpoint:
      0.0.0.0:13133
service:
  extensions: [ health_check ]
  pipelines:
    metrics:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ prometheus ]
    traces:
      receivers: [ otlp, jaeger ]
      processors: [ batch ]
      exporters: [ otlp ]
  telemetry:
    metrics:
      address: 0.0.0.0:8888

CPU and memory usage of the OpenTelemetry Collector shows nothing abnormal when the error occurs:

L3o-pold commented 4 months ago

Same issue since we upgraded from 0.88.0 to 0.100.0

OfekCyberX commented 3 months ago

Hi @crobert-1 , is there any ongoing effort to fix/mitigate this issue? we're have a deployment of otel-collector in k8s and we face the exact same issue, same stacktrace with the prometheusexporter error:

github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
    github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.87.0/log.go:23

We also tried to upgrade to latest release 0.97.0 - didn't change.

We have the otel-collector running on a dedicated node without hard limit, only requests:

resources:
  requests:
    cpu: 32
    memory: 128Gi

Memory usage & CPU usage are well under limit. Mem is 12%, CPU is 6%.

Seems it starts happening when we increase the metrics ingest volume. We have clients sending metrics every 1 min, and what caused this error to start appearing is increasing the number of clients.

Is there any fix/mitigation for this issue? via a param or config change?

Thanks

crobert-1 commented 3 months ago

My apologies, I'm largely unfamiliar with this component and its functionality. I'm not aware of any ongoing effort to address this. @Aneurysm9 do you have any suggestions here?

tuxmachine commented 1 month ago

Adding an extra datapoint. I stumbled upon this issue when I configured a k8s livenessProbe pointing to the prometheusexport endpoint.

open-telemetry / opentelemetry-collector-contrib

error encoding and sending metric family: write tcp 172.31.204.123:8889->172.31.42.221:60282: write: broken pipe #32371

Component(s)

Describe the issue you're reporting