siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.91k stars 556 forks source link

Install disk errors #9647

Closed askedrelic closed 2 weeks ago

askedrelic commented 2 weeks ago

Bug Report

Trying to use machine.install.diskSelector to apply-config to Talos in maintenance mode and getting this error:

$ talosctl apply-config -i -e x.x.x.x -n x.x.x.x --file controlplane.yml
error applying new configuration: rpc error: code = InvalidArgument desc = configuration validation failed: 1 error occurred:
        * specified install disk does not exist: ""

Description

I'm trying to install Talos via API on a bare-metal multi-disk machine, in a 2 step process due to not having physical access to the machine. First step, I ssh into a fresh Ubuntu 22.04 instance and write the metal-amd64.iso Talos installer over the existing Ubuntu disk, then reboot into Talos maintenance mode. For the 2nd step, I apply-config to install Talos with my config.

I've had several failed installs I believe due to inconsistent disk naming and am trying to use disk ids or selectors to target the corect bootable disk.

Here is the Talos disks view:

   NODE   NAMESPACE   TYPE   ID        VERSION   SIZE     READ ONLY   TRANSPORT   ROTATIONAL   WWID                                   MODEL                            SERIAL
           runtime     Disk   loop0     1         139 kB   true
           runtime     Disk   loop1     1         4.1 kB   true
           runtime     Disk   loop2     1         6.7 MB   true
           runtime     Disk   loop3     1         170 MB   true
           runtime     Disk   loop4     1         77 MB    true
           runtime     Disk   nvme0n1   1         480 GB   false       nvme                     eui.0050436e03000001                   Dell BOSS-N1                     CN0WW56VFCP0036Q004X
           runtime     Disk   nvme1n1   1         3.2 TB   false       nvme                     eui.36555430575030910025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W503091
           runtime     Disk   nvme2n1   1         3.2 TB   false       nvme                     eui.36555430575036220025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W503622
           runtime     Disk   nvme3n1   1         3.2 TB   false       nvme                     eui.36555430575030840025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W503084
           runtime     Disk   nvme4n1   1         3.2 TB   false       nvme                     eui.36555430575030870025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W503087
           runtime     Disk   nvme5n1   1         3.2 TB   false       nvme                     eui.36555430575030860025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W503086
           runtime     Disk   nvme6n1   1         3.2 TB   false       nvme                     eui.36555430575027920025384300000002   Dell Ent NVMe PM1735a MU 3.2TB   S6UTNC0W502792

Generally, I'm trying to target 480GB "boot" drive, which should be:

machine:
 install:
   disk: /dev/nvme0n1

However, trying to use disk selectors returns the error above. Same for wwdc or uuid or other attributes.

machine:
  install:
    diskSelector:
      size: '<= 1TB'

Is there a better or different way to target this specific disk?

Logs

kernel logs from the apply-config:

{"clock":741534265,"facility":"kern","msg":"block nvme1c1n1: No UUID available providing old NGUID\n SUBSYSTEM=block\n DEVICE=+block:nvme1c1n1\n","priority":"warning","seq":4129,"talos-level":"warn","talos-time":"2024-11-05T18:47:36.699475│
183Z"}                                                                                                                                                                  
{"clock":741572347,"facility":"user","msg":" * specified install disk does not exist: \"\"\n","priority":"warning","seq":4131,"talos-level":"warn","talos-time":"2024-11-05T18:47:36.737557183Z"} 

controlplane.yaml, mostly default values, logging and kernel args added for the first step maintenance mode boot

version: v1alpha1 # Indicates the schema used to decode the contents.
debug: false # NOTE: debug logs might cause crashing
persist: true
machine:
  install:
    diskSelector:
      wwid: 'eui.0050436e03000001'
    image: "factory.talos.dev/installer/xxxxxxx:v1.8.1"
    wipe: false 
    extraKernelArgs:
      - talos.logging.kernel=tcp://xxxx
      - talos.platform=metal
      - talos.halt_if_installed=0 # allow re-installs
      - ip=10.15.16.121::10.15.16.126:255.255.255.248:talos:enp27s0f0np0:off:8.8.8.8:8.8.4.4

  logging:
    destinations:
    - endpoint: tcp://xxxx # Where to send logs. Supported protocols are "tcp" and "udp".
      format: json_lines # Logs format.
...

Environment

smira commented 2 weeks ago

The error is misleading, as the "" is certainly a bug.

smira commented 2 weeks ago

I can't reproduce the issue (disk selector seems to work fine), but the error message wrong, I'll get it fixed and backported to a stable release.

We will also add another way to match the install disk and hopefully it'd be better.