siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.68k stars 535 forks source link

Pods get stuck in terminating state #9454

Open ruifung opened 2 weeks ago

ruifung commented 2 weeks ago

Bug Report

Pods randomly get stuck in Terminating state in 1.8.0, It doesn't happen for every pod, but it happens often enough to get a backlog of pods.

Description

After updating my cluster to v1.8.0, I noticed that pods very often get stuck in terminating status. While force deleting the pods CAN work for some pods. Pods with PVCs result in the volume attachments not getting cleaned up and then the safest way to recover is to restart the node(s).

Suspect issue might be related to https://github.com/containerd/containerd/issues/10727 since v1.8.0 uses containerd 2.0.0. Or maybe https://github.com/containerd/containerd/issues/10755.

Not sure, but reverting to 1.7.7 definitely resolves it since I did a little experiment where I left one node running on v1.8.0, and only that node continued to generate stuck pods.

Logs

Excerpt from CRI logs:

talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"066707c9890d866a4e2ca5452018082e735eeb09b08e42ed82c4b1331d0782e2\" pid:150265 exited_at:{seconds:1728135418 nanos:494801087}","time":"2024-10-05T13:36:58.495188602Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"db2adab2c1ef0cbb073ae975c9387373236cfc374d822041085fe1b707af74dd\" pid:150284 exited_at:{seconds:1728135418 nanos:495655288}","time":"2024-10-05T13:36:58.495934314Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"ce32d5078eceba0d526e0e7f08ed105c243c91a9d87efd9b7d3629cc03ab2412\" pid:150276 exited_at:{seconds:1728135418 nanos:495647473}","time":"2024-10-05T13:36:58.496048533Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = failed to stop sandbox \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\": failed to stop sandbox container \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" in \"SANDBOX_READY\" state: context deadline exceeded","level":"error","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" failed","time":"2024-10-05T13:36:59.964361741Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\"","time":"2024-10-05T13:37:00.395184636Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"8b22cf3f0ca6b98bcfba470db6caebbc34e71dc5114231e9c4e854b30a4a1cf2\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:00.395322440Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:00.812334019Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:01.036541035Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:01.040340873Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:01.062457311Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:02.663179695Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:02.681697009Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:02.747565408Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:03.080101532Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" to be killed: wait container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\": context canceled","level":"error","msg":"StopContainer for \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" failed","time":"2024-10-05T13:37:03.971497412Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" to be killed: wait container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\": context canceled","level":"error","msg":"StopContainer for \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" failed","time":"2024-10-05T13:37:03.971525806Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" to be killed: wait container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\": context deadline exceeded","level":"error","msg":"StopContainer for \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" failed","time":"2024-10-05T13:37:04.971540932Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" to be killed: wait container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\": context deadline exceeded","level":"error","msg":"StopContainer for \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" failed","time":"2024-10-05T13:37:04.971587842Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"82de1b7fdbdb08e91d6210188b9da8e8dde20101a1350f60f5e66f8c5c42e5ae\"","time":"2024-10-05T13:37:04.972100586Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"dbb1e042b85e88d74802dd3804a8b674be55cf64b487eb1f80756f066a8b86a9\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:04.972230305Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Kill container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\"","time":"2024-10-05T13:37:04.972741216Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = failed to stop sandbox \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\": failed to stop sandbox container \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" in \"SANDBOX_READY\" state: context canceled","level":"error","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" failed","time":"2024-10-05T13:37:06.988683839Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\"","time":"2024-10-05T13:37:07.422971089Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"cdcf58781bd5b6ee964489813432650e7e1cc1d55e71d18f21462fcf96bc681c\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:07.423119823Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"7fbf1a93b093029c6830c2286242ca1e5c5731d88e4e9ec48a1430d386b6a9f6\" pid:150399 exited_at:{seconds:1728135428 nanos:443380713}","time":"2024-10-05T13:37:08.443902426Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"1c0c9ec0d47c307238eab20ba44225be040d9b9a2ea983c2a43784c72ac82579\" pid:150403 exited_at:{seconds:1728135428 nanos:450281678}","time":"2024-10-05T13:37:08.450609227Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"b0e57a2e14131b1da3dc0f0afb03341d2c6169d477065fa99b1a0c3fcae91e09\" pid:150417 exited_at:{seconds:1728135428 nanos:452051427}","time":"2024-10-05T13:37:08.452379236Z"}

Environment

smira commented 1 week ago

Thanks for reporting this issue, if it's https://github.com/containerd/containerd/issues/10727, it's in v2.0.0-rc.5, which will be included in Talos 1.8.1