Closed Skaronator closed 1 week ago
After restarting the pushgateway pod, I got more metrics again:
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-03 | wc -l
7050
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-02 | wc -l
2811
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-01 | wc -l
2808
Are you still able to grab the logs for the previous push-gateway pod? As restarting that seemed to help, we'd be interested to see if there's anything in it's logs to hint what could've been sideways.
Sorry, forgot to mention the pushgateway logs because there are basically no logs. Here are the logs for the last 7 days:
ts=2024-04-11T08:05:10.820Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-11T08:05:10.820Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-11T08:05:10.818Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-11T08:05:10.818Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-11T08:05:10.818Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-11T08:05:10.818Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-11T08:05:10.818Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-11T08:05:10.258Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-11T08:05:10.257Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-10T10:44:40.402Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-10T10:44:40.402Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-08T15:44:20.344Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.155:52800 err="unexpected EOF"
ts=2024-04-05T09:02:50.466Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-05T09:02:50.464Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-05T09:02:50.458Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-05T09:02:50.458Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-05T09:02:50.458Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-05T09:02:50.458Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-05T09:02:50.457Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-05T09:02:49.747Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-05T09:02:49.747Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-05T08:57:00.094Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-05T08:57:00.094Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-05T08:57:00.092Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-05T08:57:00.092Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-05T08:57:00.092Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-05T08:57:00.092Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-05T08:57:00.092Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-05T08:02:29.514Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.1.189:44278 err="unexpected EOF"
ts=2024-04-05T08:02:29.511Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.173:58952 err="unexpected EOF"
ts=2024-04-05T08:02:29.509Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.0.34:57258 err="unexpected EOF"
ts=2024-04-04T12:19:01.405Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.173:42550 err="unexpected EOF"
There just 4 failed to parse error so I'd just ignore them and everything else is just the normal startup log.
The issue is marked as stale since no activity has been recorded in 30 days
This is not stale. I can give this another try with 1.30 in 2-3 weeks.
The issue is marked as stale since no activity has been recorded in 30 days
We did a disaster recovery of our 2.5 years old cluster. Well, actually used fresh Ubuntu 22.04 VMs and deployed everything new, so no backup involved and since then, this issue stabilized a bit, but it’s still missing the 3rd node.
The 3rd node was probably there at the beginning, but then dropped again after a few hours. (We deployed the monitoring stack last)
We are now using 1.30.0. pushgateway show nothing wrong.
The issue is marked as stale since no activity has been recorded in 30 days
The issue is marked as stale since no activity has been recorded in 30 days
The issue is marked as stale since no activity has been recorded in 30 days
It looks like this issue has been resolved itself. It’s now running since almost 4 weeks fine:
The only thing we might have changed (I don't have exact dates, but it was around that time +- 2 days) is our haproxy setup we use in front of k0s. We swichted from a single VM to two VMs with keepalived for a vritual IP. I don't think it should have impacted this but who knows.
Before creating an issue, make sure you've checked the following:
Platform
Version
v1.29.2+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
I'm running a cluster of 3 control planes and 3 worker nodes. The cluster is already ancient with 2y82d, but it's working flawless.
Recently, we enabled System components monitoring by adding the
--enable-metrics-scraper
argument to our 3 control plane k0s controller. We got the k0s-system namespace with the push gateway and metrics seem to reach our Prometheus.After taking a closer look, we realized that not all metrics are being received.
Taking a closer look, what happened yesterday:
You can see that initially all etcd metrics are being received. Interestingly, node3 was not restarted but now started to send metrics again. I have no idea why, the worker nodes didn't get restarted, so the pushgateway pod didn't reschedule.
When doing curl against the pushgateway endpoint I can see different number of metrics for each machine. It seems like 03 return alle metrics while 01 has 90% missing (e.g. etcd completly missing). 02 pushed zero metrics.
Steps to reproduce
Expected behavior
I expect that all metrics reach the push gateway
Actual behavior
Metrics are only partially available
Screenshots and logs
I searched in the Logs but didn't find anything useful or related to this.
For example, node 01 has 4 log entries for today that contain metrics:
Additional context
No response