siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 556 forks source link

CRI fails to start #9496

Open duhruh opened 1 month ago

duhruh commented 1 month ago

Bug Report

One of my nodes just started failing:

with the following

 {"error":"failed to recover state: failed to get metadata for stored sandbox \"c72f1f0286499086549d0a4b51e91fe929190335fbdb9d98119c654aa42d2f0e\": not found","level":"fatal","msg":"Failed to run CRI service","time":"2024-10-12T23:27:38.449872185Z"}

this is preventing cri from starting.

Is there some way we can force it?

Environment

Client:
        Tag:         v1.7.2
        SHA:         f876025b
        Built:
        Go version:  go1.22.3
        OS/Arch:     windows/amd64
Server:
        NODE:        my.node
        Tag:         v1.8.0
        SHA:         5cc935f7
        Built:
        Go version:  go1.22.7
        OS/Arch:     linux/amd64
        Enabled:     RBAC
duhruh commented 1 month ago

Okay I found a simple work around

talosctl -n my.node reset --system-labels-to-wipe EPHEMERAL --reboot

Maybe yall can add it to some troubleshooting docs for others that experience the same error?

I'll leave this open just for comment, however feel free to close this if you feel no further investigation is necessary.

smira commented 1 month ago

This is a containerd state corruption issue, we're looking towards providing a better way to wipe the state, but the workaround above is good for now.