Open mikkelhegn opened 7 months ago
invalid metric type for ...
This line from the end of the log file is interesting and it seems to come from containerd metric collection logic: https://github.com/containerd/containerd/blob/2f807b606a58020174e1aecfdec63810201e05bc/core/metrics/cgroups/v2/metrics.go#L151
Could you please verify that your nodes are on cgroups v2 by running mount | grep cgroup
?
From the node where the app is restarting:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /run/containerd/io.containerd.runtime.v2.task/k8s.io/79f773f2e2b1659aa4f3d7bb1d81a48666d3add27fce0b1b5d805a85d1021191/rootfs/sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)
From the node where the app is not restarting:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /run/containerd/io.containerd.runtime.v2.task/k8s.io/7f139ffd06a9c6a7d95d1e6c019b537d90232ee53b0d4f8417598669ad76db9d/rootfs/sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)
I am thinking there might be some metircs that we report at the shim level where they are not recognized by the containerd.
I just deployed another application to the cluster (to rule out the simple sample app).
This one also restarts, but only of one of the nodes, which the other app was restrating on. I've also noted that the restrats have not happenede for the last 11 hours on the node, where the old app used to restart, and this node does not restart the new app. So I don't know if something "fell into place" on that node. (Let me know if this becomes too confusing to keep track on).
However current restart stats for the two apps across the four nodes:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
simple-spinapp-56687588d9-7d8zf 1/1 Running 260 (11h ago) 38h 10.244.0.35 aks-nodepool1-31969921-vmss000001 <none> <none>
simple-spinapp-56687588d9-s5g95 1/1 Running 364 (21m ago) 38h 10.244.3.118 aks-nodepool2-19133108-vmss000001 <none> <none>
simple-spinapp-56687588d9-xffsh 1/1 Running 0 38h 10.244.2.8 aks-nodepool2-19133108-vmss000000 <none> <none>
simple-spinapp-56687588d9-xh8l7 1/1 Running 0 38h 10.244.1.14 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-4r7rl 0/1 CrashLoopBackOff 7 (2m25s ago) 38m 10.244.3.123 aks-nodepool2-19133108-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-f7476 1/1 Running 0 38m 10.244.0.36 aks-nodepool1-31969921-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-fm8dd 1/1 Running 0 38m 10.244.1.15 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-j5pvj 1/1 Running 0 38m 10.244.2.9 aks-nodepool2-19133108-vmss000000 <none> <none>
Without having done anything, the other nodes, which previously also restarted the spin-sample, has started restarting both the simple-sample and the new app:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
simple-spinapp-56687588d9-7d8zf 1/1 Running 307 (6m28s ago) 43h 10.244.0.130 aks-nodepool1-31969921-vmss000001 <none> <none>
simple-spinapp-56687588d9-s5g95 0/1 CrashLoopBackOff 407 (80s ago) 43h 10.244.3.208 aks-nodepool2-19133108-vmss000001 <none> <none>
simple-spinapp-56687588d9-xffsh 1/1 Running 0 43h 10.244.2.8 aks-nodepool2-19133108-vmss000000 <none> <none>
simple-spinapp-56687588d9-xh8l7 1/1 Running 0 43h 10.244.1.14 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-4r7rl 1/1 Running 51 (39s ago) 5h19m 10.244.3.209 aks-nodepool2-19133108-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-f7476 0/1 CrashLoopBackOff 47 (4m48s ago) 5h19m 10.244.0.131 aks-nodepool1-31969921-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-fm8dd 1/1 Running 0 5h19m 10.244.1.15 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-j5pvj 1/1 Running 0 5h19m 10.244.2.9 aks-nodepool2-19133108-vmss000000 <none> <none>
Just added a third nodepool using Mariner / Azue Linux, both nodes in that node pool are restarting the apps.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
simple-spinapp-56687588d9-7d8zf 1/1 Running 307 (23m ago) 43h 10.244.0.130 aks-nodepool1-31969921-vmss000001 <none> <none>
simple-spinapp-56687588d9-jwp9b 1/1 Running 1 (2m1s ago) 2m34s 10.244.4.5 aks-nodepool3-26262681-vmss000000 <none> <none>
simple-spinapp-56687588d9-s5g95 0/1 CrashLoopBackOff 409 (2m6s ago) 43h 10.244.3.215 aks-nodepool2-19133108-vmss000001 <none> <none>
simple-spinapp-56687588d9-wdt9b 1/1 Running 1 (2m1s ago) 2m34s 10.244.5.6 aks-nodepool3-26262681-vmss000001 <none> <none>
simple-spinapp-56687588d9-xffsh 1/1 Running 0 43h 10.244.2.8 aks-nodepool2-19133108-vmss000000 <none> <none>
simple-spinapp-56687588d9-xh8l7 1/1 Running 0 43h 10.244.1.14 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-4r7rl 1/1 Running 55 (4m46s ago) 5h37m 10.244.3.214 aks-nodepool2-19133108-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-dx8zv 1/1 Running 2 (50s ago) 2m23s 10.244.5.7 aks-nodepool3-26262681-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-f7476 0/1 CrashLoopBackOff 50 (3m19s ago) 5h37m 10.244.0.136 aks-nodepool1-31969921-vmss000001 <none> <none>
spinkube-live-mikkel-75cc76c678-fm8dd 1/1 Running 0 5h37m 10.244.1.15 aks-nodepool1-31969921-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-j5pvj 1/1 Running 0 5h37m 10.244.2.9 aks-nodepool2-19133108-vmss000000 <none> <none>
spinkube-live-mikkel-75cc76c678-q88q5 1/1 Running 1 (50s ago) 2m23s 10.244.4.6 aks-nodepool3-26262681-vmss000000 <none> <none>
I checked on one of the node where this issue is happening and observed that it is returning cgroups v1 metrics:
root@aks-nodepool1-31969921-vmss000001:/# /tmp/ttrpc-client /run/containerd/f1e518aba0e9dccf.sock 94c1433b03b96c984b1bf15773e409682ae64b6d697f7392c1bb37256f492bca
type_url:"io.containerd.cgroups.v1.Metrics" value:"\x12\x03\x08\xee\x01\x1a\x16\n\x12\x08\xfdܙ\xae\n\x10\xe4\xfd\xa0\xf9\x03\x18\x98\xdf\xf8\xb4\x06\x12\x00\"\t\x8a\x02\x06\x10\x80\xe0\x9c\xff\x02"
trying to find out why that is the case.
another data point, but i am unable to make anything out of it right now. (i was expecting crictl to fail as well)
root@aks-nodepool1-31969921-vmss000001:/# crictl ps | grep spin
01748f8974627 bca3c4d784227 59 minutes ago Running simple-spinapp 323 94c1433b03b96 simple-spinapp-56687588d9-7d8zf
f4ff19248a4ef f99fcd5dfb557 About an hour ago Running spinkube-live-mikkel 62 b0540e08d14db spinkube-live-mikkel-75cc76c678-f7476
crictl stats f4ff19248a4ef
CONTAINER NAME CPU % MEM DISK INODES
f4ff19248a4ef spinkube-live-mikkel 0.00 33.43MB 7.881MB 23
crictl stats 01748f8974627
CONTAINER NAME CPU % MEM DISK INODES
01748f8974627 simple-spinapp 0.00 10.28MB 90.11kB 26
i added some debug logging to shim and used that on the node having crashes:
Apr 23 17:05:22 aks-nodepool1-31969921-vmss000003 containerd[212274]: time="2024-04-23T17:05:22.943002063Z" level=info msg="stats: StatsRequest { id: "2008530c02eec982e25a6ebe9f5e005b0704b4ed8546f7c7b5f3e104f9c17f4
0", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"
Apr 23 17:05:22 aks-nodepool1-31969921-vmss000003 containerd[212274]: time="2024-04-23T17:05:22.945109028Z" level=info msg="stats resp: StatsResponse { stats: MessageField(Some(Any { type_url: "io.containerd.cgroup
s.v1.Metrics", value: [18, 2, 8, 5, 26, 16, 10, 12, 8, 159, 181, 14, 16, 154, 197, 7, 24, 133, 240, 6, 18, 0, 34, 8, 138, 2, 5, 16, 128, 128, 169, 16], special_fields: SpecialFields { unknown_fields: UnknownFields
{ fields: None }, cached_size: CachedSize { size: 0 } } })), special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"
so it seems like it is some how receiving cgroup v1 metrics.
I deployed the simple-app sample to an AKS cluster, and is observing that two of four pods are continuously restarting.
The cluster consists of four nodes (two are amd62, two are arm64). The restarts are happening on one node on each architecture. All nodes run Ubuntu 22.04.
I've obtained the containerd logs (using
journalctl -u containerd
), and attached here. Let me know what else could be needed to troubleshoot this problem. I still have the cluster running, and the app deployed.containerd-restarting.log containerd-not-restarting.log