prometheus / pushgateway

Push acceptor for ephemeral and batch jobs.
Apache License 2.0
3k stars 466 forks source link

Clients pushing metrics to pushgateway are receiving intermittent timeouts #643

Closed navathag closed 2 months ago

navathag commented 4 months ago

We have a single instance of pushgateway server deployed as kubernetes pod and We have 100's of clients external to this kubernetes cluster are pushing metrics to the pushgateway server on their cert related metrics on a regular interval. We observe intermittent read timeout error on clients pushing their metrics to pushgateway. We did not see any resource starving on pushgateway either. We do have cronjob scheduled to run every 1hr to delete the stale metrics on pushgateway server which will check for last 30mins. How can we resolve this From pushgateway log we could observe {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:47092","ts":"2024-05-22T21:16:17.351Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43806","ts":"2024-05-22T21:16:17.352Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48326","ts":"2024-05-22T21:16:17.352Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43822","ts":"2024-05-22T21:16:17.352Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:49886","ts":"2024-05-22T21:16:17.353Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43918","ts":"2024-05-22T21:16:17.353Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48168","ts":"2024-05-22T21:16:17.354Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43978","ts":"2024-05-22T21:16:17.354Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43904","ts":"2024-05-22T21:16:17.354Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48104","ts":"2024-05-22T21:16:17.354Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:47068","ts":"2024-05-22T21:16:17.354Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:44094","ts":"2024-05-22T21:16:17.355Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48142","ts":"2024-05-22T21:16:17.355Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48146","ts":"2024-05-22T21:16:17.355Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48320","ts":"2024-05-22T21:16:17.355Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:44044","ts":"2024-05-22T21:16:17.355Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43988","ts":"2024-05-22T21:16:17.356Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48272","ts":"2024-05-22T21:16:17.360Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43744","ts":"2024-05-22T21:16:17.360Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:44110","ts":"2024-05-22T21:16:17.360Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:43872","ts":"2024-05-22T21:16:17.360Z"} {"caller":"push.go:106","err":"unexpected EOF","level":"debug","msg":"failed to parse text","source":"istio-ingress-pod-ip:48158","ts":"2024-05-22T21:16:17.360Z"}.

beorn7 commented 4 months ago

You could do some profiling on the PGW to see where the bottleneck is.

But generally, the PGW isn't designed for high load, so I wouldn't be surprised if the problem is literally "by design".

beorn7 commented 2 months ago

Closing for lack of follow-up.