siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.9k stars 555 forks source link

Install and onboarding is a poor and brittle process #9702

Closed smst329 closed 1 week ago

smst329 commented 1 week ago

Bug Report

ISO is stuck in a boot loop when booting from USB. Even before that started it was not a great experience.

Description

Generally the install and onboarding process is a poor experience. The ISO (v1.8.2 from GH releases) assumes machines will be on the happy path automatically.

After I tried to redo an install, the ISO downloaded from github releases got stuck in a bad state where on a loop it just kept rebooting over and over again.

It does this even after booting into another distro and wiping the drive that talos installed on. And after reimaging the USB stick with a new copy of the ISO. And doing "Reset talos installation" from the boot menu.

The assumption that the machine will automatically get into a bootable state where it can be configured over the network is a happy path assumption. That may not happen and like other linux installers it would be nice to have opportunities to intervene and attempt to reconfigure and troubleshoot.

The concept of a very minimal K8s based linux distro is interesting, but you should not throw out everything other distros are doing. The install process needs some of the opportunities for user interaction and configuration that other installers provide. It needs visual feedback showing what steps have been performed and what the state of the install is.

Have some prompts for network config.

Show the drive being partitioned and then the install being copied over. Prompt the user to ask if they want to wipe the drive if there is an existing install. Users will benefit from feedback clearly showing that their machine has been put into a state where it is bootable and correctly configured.

The state that the ISO puts my machine in cannot be configured by talosctl. So there needs to be escape hatches that are not based on talosctl.

There is a reason almost every operating system on earth does this whether its other linux distros, macOS, windows, etc. The install and onboarding could be far saner if the way things work now is one option, but there are also prompts and points of user interaction which can help move the process along if there is an issue that prevents the OS from fully coming up.

I have no idea what's wrong because the machine reboots before I can even read all the logs.

Logs

I wish I could get logs. But the machine reboots before I can.

Environment

Intel i7, 32gb ram. Network via ethernet.

Problem is only talos OS env. When I boot into another linux distro, network works, disks work, etc.

smira commented 1 week ago

The only thing I can think of is that you're booting off USB with Talos already installed, which leads to an error with Talos 1.8. Make sure your boot order starts from disk or eject USB after install.

As you don't provide any data, it's hard to help in any way.

smst329 commented 1 week ago

Like I said, I cannot provide data. Which is why the installer needs to not just assume it can boot and then reboot if it can't boot.

Even if I could provide more info, the reason I should not do that is that this issue actively makes it more difficult for me to provide information or attempt to solve my own problems.

Even if any other suggested features will be ignored, I'd rather the live boot image halt and allow me to read what is on the screen and wait for me to reboot the system. Bonus points if I can page up and down.

Hopefully it shouldn't be too hard to simply stop rebooting forever.

Again, I think problems in this category would be solved much easier if talos had some of the traditional items that pretty every OS installer has.

Sure for boxes that follow the happy path, we can and should use talosctl for headless installs on headless servers. I actually like that. Extend declaritive configuration down to the metal! That sounds good to me!

But the installer would be much better with other options when not on the happy path. It should not respond to something unexpected by deciding to reboot in a loop forever. I would prefer not to use those other options most of the time. I want to just talosctl a config and then move on. But sometimes, nothing else will do, and we have to do imperative sysadmin.

xzizka commented 1 week ago

I agree that for the first time, it can be difficult, but also I am not sure what is your request here. Are you connected to the network where DHCP is enabled? If yes, IP address should be assigned and then you should be able to use talosctl dashboard and see what you see in the machine console. You can also roll up and down messages there. https://www.talos.dev/v1.8/talos-guides/interactive-dashboard/

Do you use a network without DHCP? You can set the IP address via kernel parameters (https://www.talos.dev/v1.8/reference/kernel/). You can set these parameters via https://factory.talos.dev/ and let the tool generate your ISO with your configuration.

You can also deliver the complete configuration this way https://www.talos.dev/v1.8/reference/kernel/#talosconfig

This part of the learning process is hard. My own experience. I don't know if you just play with the technology or plan to use it for some production. The first step I recommend is to run Talos in VM(s) with DHCP on (VirtualBox etc.). Learn how to create the cluster and then tune the installation process. For "production use" I recommend tools like Pulumi or Terraform and sort it out this way. Talos is not intended for use with interactive installation wizards.

smst329 commented 1 week ago

First time is not that bad. I did a successful talos install.

The issue is that WHEN the live image encounters a problem, a "strategy" it may employ is permanently rebooting in a loop and rebooting after only a handful of seconds.

The request is not to reboot after a handful of seconds and to stop rebooting forever in a loop when it is a LIVE image.

This actively makes it difficult to gather information to even assess of whether the problem is on my end or on talos. I can only see what's on screen for a few seconds. Which is why this particular bug actually makes sense to work without information. Because the bug actively impedes the ability to provide information.

smira commented 1 week ago

Talos doesn't have a concept of "Live CD" at all, but it refuses to boot from an ISO if Talos is already installed on disk due to issues which lead to people using wrong boot media.

Wipe your disks, and you can go on with ISO boot once again.

https://www.talos.dev/v1.8/introduction/what-is-new/#taloshalt_if_installed-kernel-argument

smst329 commented 1 week ago

If you don't want to call it "live" that's OK. But it was boot from USB. And what if talos is installed and I want to install again from the boot media which happened to be a USB.

I don't see how rebooting forever in a loop is a good experience.

smira commented 1 week ago

As I stated above, you need to wipe the disk.

I posted the reason for the "boot loop" above as well, certainly it might not be the best behavior from your point of view today, but it does save other users from a lot of trouble later.

smst329 commented 1 week ago

I did wipe the disk. Are you sure that is the only thing that can cause a loop?

Will messing with the kernel argument above fix all possible sources of a reboot loop? Seems only related to prior install.

And if there is a boot loop, how do you want me to verify and prove whether or not there is an issue, when its rebooting permanently?

smira commented 1 week ago

I can't guess what exactly the issue is.

One can use serial kernel logs (which need to be enabled) to capture all logs, use IPMI or other BMC to capture them and keep the history.

You can use Omni to debug the issue, as Talos would stream logs to Omni.

You can use your phone to capture the video and look into frames later on.

Talos should by default pause for 10 seconds before rebooting.

smst329 commented 1 week ago

I never asked for Talos to determine what the issue is. I'm telling you that infinite boot loops were a bad experience when doing an install. You can do with that what you will.