siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.77k stars 541 forks source link

`machined.run` function blocks forever if `c.V1Alpha2().Run` fails without ever running controllers. #8263

Open DmitriyMV opened 9 months ago

DmitriyMV commented 9 months ago

The main culprit is here:

https://github.com/siderolabs/talos/blob/474fa0480dd68d112a608548e4d0a0c4efa39e20/internal/app/machined/main.go#L208-L214

If it exits too soon, the

https://github.com/siderolabs/talos/blob/474fa0480dd68d112a608548e4d0a0c4efa39e20/internal/app/machined/main.go#L227-L233

will block forever, because EnforceKSPPRequirements (specifically runtime.KernelParamsSetCondition) depends on runtime running. But even if we somehow exit from this function we will forever block here https://github.com/siderolabs/talos/blob/474fa0480dd68d112a608548e4d0a0c4efa39e20/internal/app/machined/main.go#L197-L199

Whats is interesting here, is that passed ctx will essentially act as context.Background() because deferred cancelation will happen AFTER we finish with system.Singleton.Shutdown(...). And those, if runtime is not running, Shutdown will block forever too.

I try to fix this in #8256, but I don't think we should pass canceled context to the Shutdown(...).

Instead we create a new context with timeout of 60 seconds (stopServices try to wait for about 30 seconds, so we make the overall context twice as big). But I'm unsure if it's proper fix or not.

smira commented 9 months ago

This code is ugly, as it combines older sequence-based initialization and newer controller-based approach.

But if we want to fix it (which I believe we don't need to for real), we just need to wait with some timeout and fail the sequencer. This works well.

If we need this for debugging purposes, we can add a log line.

DmitriyMV commented 9 months ago

The thing is, under current implementation you will never see fatal controller runtime error because receiving from the channel happens after running from the sequencer, which, as it turns out, never happens. We can add the log info about controller runtime failing, but the machine will be stuck anyway.

we just need to wait with some timeout and fail the sequencer. This works well.

We would also need to add timeout for Shutdown for the reasons I outlined above.

which I believe we don't need to for real

c.V1Alpha2().Run doesn't do much - it only registers controllers and runs them. But if we choose to ignore this, we should at least document that if controller registration fails, the machine will be stuck (unlike reboot which is what most users is expecting probably?) forever.

DmitriyMV commented 9 months ago

Created #8264 as a quick fix.

smira commented 9 months ago

c.V1Alpha2().Run doesn't do much - it only registers controllers and runs them. But if we choose to ignore this, we should at least document that if controller registration fails, the machine will be stuck (unlike reboot which is what most users is expecting probably?) forever.

that's my point - it doesn't happen unless it's a developer error, so this is a non-issue for our users, but rather improvement for developers. this is not bad to be fixed, but I would rather try to find the least intrusive way to do so, and in the end this code will most probably go away.