siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 555 forks source link

infinite boot loop #9720

Open smst329 opened 1 week ago

smst329 commented 1 week ago

Bug Report

Talos ISO just reboots infinitely forever and never stops.

https://github.com/siderolabs/talos/issues/9702 ^ In that bug report they kept saying I needed to wipe the disk/previous install.

Funny thing happened today, new hard drive came in the mail, and there is still an infinite boot loop. I didn't know hard drives came pre-installed with talos.

I'm just reporting the bug, in case it affects any potential or current customers.

Description

Logs

Environment

smira commented 1 week ago

Without the logs, it's impossible to tell. If you have i915 by chance, it might be fixed by adding i915-ucode system extension. (This is going to be fixed in 1.9).

erickuiper commented 1 week ago

I just ran into this same issue again after destroying and recreating a cluster running on 1.8.3 while having the extension enabled on the node.

Will this be resolved in the mentioned 1.9 fix?

smst329 commented 1 week ago

If you have i915 maybe. If you don't then probably not. I am not sure they fully understand all the causes of their boot loops. I don't have an i915 so they're supposition is wrong again.

I'd like them to reconsider infinite boot loops as a strategy for responding to a problem. Like what conditions is a reboot changing where on 13th reboot things work again but they didn't on the 12th. Like does 12 reboots clear a previous install? Does 12 reboots cause a USB stick to fly out of the machine? Does 12 reboots fix the dhcp server?

They dont have to agree, but I think infinite boot loops are bad design. There are other kind of loops other than a boot loop. And they could even have a progressive backoff like the k8s crash loop backoff so its not a hot loop.

rdenouden commented 6 days ago

The I915 drivers have some bad history of bootloops and crashes.

You can get into your machine again by adding i915.modeset=0 in the kernel parameters and it just runs fine for now.

rdenouden commented 6 days ago

At the moment of writing I can not add extensions to 1.8.3 machines. It was the same with the 1.8.2 upgrade for a while. I added the i915.modeset=0 as extraKernelArgs and the intel NUC nodes with I915 video are now stable.

I have to say that this should be a stern warning not to jump on the latest version until it settles down. It's now in a short time I am evaluating talos with OMNI that we have seen such issues with the 1.8 releases. I love talos, but it's the release process which worries me a bit.

smira commented 4 days ago

We plan to remove i915 driver out of base Talos in 1.9, so that it will use UEFI for the framebuffer (unless you want to add an extension). #9728