k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.56k stars 356 forks source link

k0s-pushgateway missing metrics from Control Planes after a while #4268

Closed Skaronator closed 1 week ago

Skaronator commented 5 months ago

Before creating an issue, make sure you've checked the following:

Platform

fin-kubm-vm-01:~$ uname -srvmo; cat /etc/os-release || lsb_release -a

Linux 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Version

v1.29.2+k0s.0

Sysinfo

`k0s sysinfo`
$ sudo k0s sysinfo
Machine ID: "7f250da9878c8d1542136402b43ec42dd8d5b0a83de8889fac4f4cabb545b7cc" (from machine) (pass)
Total memory: 3.8 GiB (pass)
Disk space available for /var/lib/k0s: 33.5 GiB (pass)
Name resolution: localhost: [127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 5.15.0-102-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: built-in (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

I'm running a cluster of 3 control planes and 3 worker nodes. The cluster is already ancient with 2y82d, but it's working flawless.

Recently, we enabled System components monitoring by adding the --enable-metrics-scraper argument to our 3 control plane k0s controller. We got the k0s-system namespace with the push gateway and metrics seem to reach our Prometheus.

After taking a closer look, we realized that not all metrics are being received.

image

Taking a closer look, what happened yesterday: image

You can see that initially all etcd metrics are being received. Interestingly, node3 was not restarted but now started to send metrics again. I have no idea why, the worker nodes didn't get restarted, so the pushgateway pod didn't reschedule.

When doing curl against the pushgateway endpoint I can see different number of metrics for each machine. It seems like 03 return alle metrics while 01 has 90% missing (e.g. etcd completly missing). 02 pushed zero metrics.

$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-01 | wc -l
638
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-02 | wc -l
0
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-03 | wc -l
7050

Steps to reproduce

  1. Deploy 3 CP & 3 Nodes
  2. (wait 2 years?)
  3. Enable Metrics Scrape
  4. Verify that all metrics are beeing recieved.

Expected behavior

I expect that all metrics reach the push gateway

Actual behavior

Metrics are only partially available

Screenshots and logs

I searched in the Logs but didn't find anything useful or related to this.

For example, node 01 has 4 log entries for today that contain metrics:

fin-kubm-vm-01 $ cat /var/log/syslog | grep metrics | grep -v Grafana
Apr 10 10:55:24 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:24" level=error msg="error sending POST request for job kube-scheduler: no endpoints available for service \"http:k0s-pushgateway:http\"" component=metrics metrics_job=kube-scheduler
Apr 10 10:55:24 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:24" level=error msg="error sending POST request for job kube-controller-manager: no endpoints available for service \"http:k0s-pushgateway:http\"" component=metrics metrics_job=kube-controller-manager
Apr 10 10:55:24 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:24" level=error msg="error sending POST request for job etcd: no endpoints available for service \"http:k0s-pushgateway:http\"" component=metrics metrics_job=etcd
Apr 10 10:55:25 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:25" level=info msg="W0410 10:55:25.148724     972 aggregator.go:166] failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 10:55:25 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:25" level=info msg="W0410 10:55:25.602773     972 aggregator.go:166] failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 10:55:26 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:26" level=info msg="W0410 10:55:26.006288     972 aggregator.go:166] failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 10:55:26 fin-kubm-vm-01 k0s[696]: time="2024-04-10 10:55:26" level=info msg="E0410 10:55:26.446143     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\": the object has been modified; please apply your changes to the latest version and try again" component=kube-apiserver stream=stderr
Apr 10 12:10:27 fin-kubm-vm-01 k0s[696]: time="2024-04-10 12:10:27" level=info msg="E0410 12:10:27.702374     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1: Get \"https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" component=kube-apiserver stream=stderr
Apr 10 12:10:27 fin-kubm-vm-01 k0s[696]: time="2024-04-10 12:10:27" level=info msg="E0410 12:10:27.706869     972 controller.go:146] Error updating APIService \"v1beta1.metrics.k8s.io\" with err: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 12:10:32 fin-kubm-vm-01 k0s[696]: time="2024-04-10 12:10:32" level=info msg="E0410 12:10:32.705754     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1: Get \"https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" component=kube-apiserver stream=stderr
Apr 10 12:10:37 fin-kubm-vm-01 k0s[696]: time="2024-04-10 12:10:37" level=info msg="E0410 12:10:37.729808     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1: Get \"https://10.101.57.31:443/apis/metrics.k8s.io/v1beta1\": http2: client connection lost" component=kube-apiserver stream=stderr
Apr 10 15:22:27 fin-kubm-vm-01 k0s[696]: time="2024-04-10 15:22:27" level=info msg="E0410 15:22:27.928776     972 controller.go:146] Error updating APIService \"v1beta1.metrics.k8s.io\" with err: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 15:22:27 fin-kubm-vm-01 k0s[696]: time="2024-04-10 15:22:27" level=info msg="E0410 15:22:27.941563     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\": the object has been modified; please apply your changes to the latest version and try again" component=kube-apiserver stream=stderr
Apr 10 15:22:37 fin-kubm-vm-01 k0s[696]: time="2024-04-10 15:22:37" level=info msg="E0410 15:22:37.941711     972 controller.go:146] Error updating APIService \"v1beta1.metrics.k8s.io\" with err: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 15:22:37 fin-kubm-vm-01 k0s[696]: time="2024-04-10 15:22:37" level=info msg="E0410 15:22:37.957649     972 available_controller.go:460] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\": the object has been modified; please apply your changes to the latest version and try again" component=kube-apiserver stream=stderr
Apr 10 17:48:58 fin-kubm-vm-01 k0s[696]: time="2024-04-10 17:48:58" level=info msg="E0410 17:48:58.526585     972 controller.go:146] Error updating APIService \"v1beta1.metrics.k8s.io\" with err: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 10 17:49:08 fin-kubm-vm-01 k0s[696]: time="2024-04-10 17:49:08" level=info msg="E0410 17:49:08.537255     972 controller.go:146] Error updating APIService \"v1beta1.metrics.k8s.io\" with err: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable" component=kube-apiserver stream=stderr
Apr 11 08:05:11 fin-kubm-vm-01 k0s[696]: time="2024-04-11 08:05:11" level=error msg="error sending POST request for job etcd: error trying to reach service: EOF" component=metrics metrics_job=etcd
Apr 11 08:05:11 fin-kubm-vm-01 k0s[696]: time="2024-04-11 08:05:11" level=error msg="error sending POST request for job kube-scheduler: error trying to reach service: EOF" component=metrics metrics_job=kube-scheduler
Apr 11 08:05:11 fin-kubm-vm-01 k0s[696]: time="2024-04-11 08:05:11" level=error msg="error sending POST request for job etcd: no endpoints available for service \"k0s-pushgateway\"" component=metrics metrics_job=etcd
Apr 11 08:05:11 fin-kubm-vm-01 k0s[696]: time="2024-04-11 08:05:11" level=error msg="error sending POST request for job kube-scheduler: no endpoints available for service \"k0s-pushgateway\"" component=metrics metrics_job=kube-scheduler

Additional context

No response

Skaronator commented 5 months ago

After restarting the pushgateway pod, I got more metrics again:

$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-03 | wc -l
7050
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-02 | wc -l
2811
$ curl -s http://localhost:9091/metrics | grep fin-kubm-vm-01 | wc -l
2808

image

jnummelin commented 5 months ago

Are you still able to grab the logs for the previous push-gateway pod? As restarting that seemed to help, we'd be interested to see if there's anything in it's logs to hint what could've been sideways.

Skaronator commented 5 months ago

Sorry, forgot to mention the pushgateway logs because there are basically no logs. Here are the logs for the last 7 days:

ts=2024-04-11T08:05:10.820Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-11T08:05:10.820Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-11T08:05:10.818Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-11T08:05:10.818Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-11T08:05:10.818Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-11T08:05:10.818Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-11T08:05:10.818Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-11T08:05:10.258Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-11T08:05:10.257Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-10T10:44:40.402Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-10T10:44:40.402Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-08T15:44:20.344Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.155:52800 err="unexpected EOF"
ts=2024-04-05T09:02:50.466Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-05T09:02:50.464Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-05T09:02:50.458Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-05T09:02:50.458Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-05T09:02:50.458Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-05T09:02:50.458Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-05T09:02:50.457Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-05T09:02:49.747Z caller=main.go:200 level=error msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
ts=2024-04-05T09:02:49.747Z caller=main.go:252 level=info msg="received SIGINT/SIGTERM; exiting gracefully..."
ts=2024-04-05T08:57:00.094Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2024-04-05T08:57:00.094Z caller=main.go:140 level=info listen_address=:9091
ts=2024-04-05T08:57:00.092Z caller=main.go:90 level=debug msg="path prefix for internal routing" path=
ts=2024-04-05T08:57:00.092Z caller=main.go:89 level=debug msg="path prefix used externally" path=
ts=2024-04-05T08:57:00.092Z caller=main.go:88 level=debug msg="external URL" url=
ts=2024-04-05T08:57:00.092Z caller=main.go:87 level=info build_context="(go=go1.19.6, user=root@buildkitsandbox, date=20230217-09:16:39)"
ts=2024-04-05T08:57:00.092Z caller=main.go:86 level=info msg="starting pushgateway" version="(version=1.4.0, branch=HEAD, revision=b28bd0363ed3112fc0c1d39813cdc1c1d335bdf1)"
ts=2024-04-05T08:02:29.514Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.1.189:44278 err="unexpected EOF"
ts=2024-04-05T08:02:29.511Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.173:58952 err="unexpected EOF"
ts=2024-04-05T08:02:29.509Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.0.34:57258 err="unexpected EOF"
ts=2024-04-04T12:19:01.405Z caller=push.go:111 level=debug msg="failed to parse text" source=10.244.2.173:42550 err="unexpected EOF"

There just 4 failed to parse error so I'd just ignore them and everything else is just the normal startup log.

github-actions[bot] commented 4 months ago

The issue is marked as stale since no activity has been recorded in 30 days

Skaronator commented 4 months ago

This is not stale. I can give this another try with 1.30 in 2-3 weeks.

github-actions[bot] commented 3 months ago

The issue is marked as stale since no activity has been recorded in 30 days

Skaronator commented 3 months ago

We did a disaster recovery of our 2.5 years old cluster. Well, actually used fresh Ubuntu 22.04 VMs and deployed everything new, so no backup involved and since then, this issue stabilized a bit, but it’s still missing the 3rd node.

image

The 3rd node was probably there at the beginning, but then dropped again after a few hours. (We deployed the monitoring stack last)

We are now using 1.30.0. pushgateway show nothing wrong.

github-actions[bot] commented 2 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 1 month ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 1 week ago

The issue is marked as stale since no activity has been recorded in 30 days

Skaronator commented 1 week ago

It looks like this issue has been resolved itself. It’s now running since almost 4 weeks fine:

image

The only thing we might have changed (I don't have exact dates, but it was around that time +- 2 days) is our haproxy setup we use in front of k0s. We swichted from a single VM to two VMs with keepalived for a vritual IP. I don't think it should have impacted this but who knows.