siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.84k stars 550 forks source link

Raspberry Pi CM4 doesn't reboot cleanly -- hangs after GRUB. #6273

Closed mmdevworkshop closed 2 years ago

mmdevworkshop commented 2 years ago

Bug Report

Raspberry Pi CM4 doesn't reboot cleanly -- hangs after GRUB.

Description

I've installed 6x CM4 units in a https://github.com/DeskPi-Team/super6c mini-itx cluster board for the purpose of experimenting with Kubernetes in a home lab environment. Talos seems like a good fit. I managed to flail my way through the documentation, but after flashing and reflashing things many times, I have noticed that the individual modules won't reboot cleanly. Troubleshooting is made trickier by the fact that only the first of the 6 modules has a console output. While that was the one I made the control plane, talosctl dmesg and experimentation suggests that control plane vs worker is irrelevant here.

I have tried both talosctl reboot and talosctl reboot -m powercycle, but it didn't make a difference. I found https://github.com/pftf/RPi4/issues/168 and attempted to add the kernel parameter efi=no_disable_early_pci_dma (got hung up for a while figuring out how to make it add the parameter -- lurkers: you need to go through a talosctl upgrade), but that didn't affect anything. I also tried setting reboot=k (which seems to be referenced in various examples , so I thought maybe it was common for some reason), but that didn't help either. I might have enough patience to try going through various reboot= parameters tomorrow, but I'm not holding out much hope.

I did install regular old raspbian on one of the modules, and sudo reboot works fine; it comes back up and is sshable shortly. I previously had done one of the modules which now is running Talos, so it does not seem related to the specific compute module or the socket/location. I inspected the kernel parameters used by the default Raspbian install, but nothing stands out; these are the parameters it has set: console=serial0,115200 console=tty1 root=PARTUUID=8dae39b5-02 rootfstype=ext4 fsck.repair=yes rootwait.

I'm kind of out of ideas from here. Unfortunately this is a deal breaker for playing with Talos, since this board cannot power cycle individual modules: it's not possible for me to do a rolling upgrade (every node I reboot is dead until I turn everything off), which kind of defeats the whole point.

I do have some time on my hands and am willing to help run down what's happening if someone can guide me, though. I don't know much about the subtle mechanics of rebooting (I didn't know it was so complicated until today's googling!)

Logs

The only module I can directly observe is the first module, but the others appear to behave in the same fashion as far as I can observe. The module cold boots fine, but upon issuing talosctl reboot, the following happens:

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

Nothing else happens from here; I've left it sitting for at least 30 minutes.

Here's what prints from dmesg -f up until the connection is closed:

192.168.2.41: user: warning: [2022-09-15T05:25:08.79701133Z]: [talos] reboot via API received. actor id: 9312d28e-e7e3-40ed-be0c-a980824a3c47
192.168.2.41: user: warning: [2022-09-15T05:25:08.80939233Z]: [talos] reboot sequence: 12 phase(s)
192.168.2.41: user: warning: [2022-09-15T05:25:08.81597533Z]: [talos] phase cleanup (1/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:08.82304733Z]: [talos] task stopAllPods (1/1): starting
192.168.2.41: user: warning: [2022-09-15T05:25:08.82983133Z]: [talos] task stopAllPods (1/1): waiting for kubelet lifecycle finalizers
192.168.2.41: user: warning: [2022-09-15T05:25:08.84130533Z]: [talos] task stopAllPods (1/1): shutting down kubelet gracefully
192.168.2.41: user: warning: [2022-09-15T05:25:29.14916533Z]: [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 3496, container kubelet)
192.168.2.41: user: warning: [2022-09-15T05:25:29.44414733Z]: [talos] service[kubelet](Finished): Service finished successfully
192.168.2.41: user: warning: [2022-09-15T05:25:29.48258533Z]: [talos] skipping pod kube-system/kube-proxy-rbw9k, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.49392533Z]: [talos] skipping pod kube-system/kube-flannel-7hzql, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.50561133Z]: [talos] skipping pod kube-system/kube-scheduler-cm41, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.51783233Z]: [talos] skipping pod kube-system/kube-proxy-rbw9k, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.52979733Z]: [talos] skipping pod kube-system/kube-apiserver-cm41, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.54113033Z]: [talos] skipping pod kube-system/kube-controller-manager-cm41, state SANDBOX_NOTREADY
192.168.2.41: user: warning: [2022-09-15T05:25:29.55300733Z]: [talos] task stopAllPods (1/1): done, 20.730038768s
192.168.2.41: user: warning: [2022-09-15T05:25:29.56046833Z]: [talos] phase cleanup (1/12): done, 20.744424969s
192.168.2.41: user: warning: [2022-09-15T05:25:29.56746933Z]: [talos] phase dbus (2/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:29.57347433Z]: [talos] task stopDBus (1/1): starting
192.168.2.41: user: warning: [2022-09-15T05:25:29.57954533Z]: [talos] task stopDBus (1/1): done, 6.06979ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.58593733Z]: [talos] phase dbus (2/12): done, 18.477852ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.59216833Z]: [talos] phase stopServices (3/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:29.59855533Z]: [talos] task stopServicesForUpgrade (1/1): starting
192.168.2.41: user: warning: [2022-09-15T05:25:29.60556833Z]: [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 3495, container etcd)
192.168.2.41: user: warning: [2022-09-15T05:25:29.61645933Z]: [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 3412, container trustd)
192.168.2.41: user: warning: [2022-09-15T05:25:29.62931033Z]: [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
192.168.2.41: user: warning: [2022-09-15T05:25:29.65042633Z]: [talos] service[udevd](Finished): Service finished successfully
192.168.2.41: user: warning: [2022-09-15T05:25:29.76620333Z]: [talos] service[trustd](Finished): Service finished successfully
192.168.2.41: user: warning: [2022-09-15T05:25:29.78306433Z]: [talos] service[etcd](Finished): Service finished successfully
192.168.2.41: user: warning: [2022-09-15T05:25:29.79284033Z]: [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
192.168.2.41: user: warning: [2022-09-15T05:25:29.84200633Z]: [talos] service[cri](Finished): Service finished successfully
192.168.2.41: user: warning: [2022-09-15T05:25:29.85181933Z]: [talos] task stopServicesForUpgrade (1/1): done, 253.255714ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.86293733Z]: [talos] phase stopServices (3/12): done, 270.750959ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.87126033Z]: [talos] phase unmountUser (4/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:29.87906933Z]: [talos] task unmountUserDisks (1/1): starting
192.168.2.41: kern:  notice: [2022-09-15T05:25:29.88764233Z]: XFS (nvme0n1p1): Unmounting Filesystem
192.168.2.41: user: warning: [2022-09-15T05:25:29.91078733Z]: [talos] task unmountUserDisks (1/1): done, 31.708078ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.91925433Z]: [talos] phase unmountUser (4/12): done, 47.998846ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.92732933Z]: [talos] phase umount (5/12): 2 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:29.93442133Z]: [talos] task unmountPodMounts (2/2): starting
192.168.2.41: user: warning: [2022-09-15T05:25:29.94162233Z]: [talos] task unmountOverlayFilesystems (1/2): starting
192.168.2.41: user: warning: [2022-09-15T05:25:29.95053733Z]: [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/cfad13b0-d4e8-4e0a-b8d7-9f6a0ae57fd6/volumes/kubernetes.io~projected/kube-api-access-6qjrw
192.168.2.41: user: warning: [2022-09-15T05:25:29.97093433Z]: [talos] task unmountPodMounts (2/2): done, 36.513151ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.98121133Z]: [talos] task unmountOverlayFilesystems (1/2): done, 46.727351ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.99228333Z]: [talos] phase umount (5/12): done, 64.940186ms
192.168.2.41: user: warning: [2022-09-15T05:25:29.99974933Z]: [talos] phase unmountBind (6/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.00711933Z]: [talos] task unmountSystemDiskBindMounts (1/1): starting
192.168.2.41: user: warning: [2022-09-15T05:25:30.01553233Z]: [talos] task unmountSystemDiskBindMounts (1/1): unmounting /system/state
192.168.2.41: kern:  notice: [2022-09-15T05:25:30.02648033Z]: XFS (mmcblk0p5): Unmounting Filesystem
192.168.2.41: user: warning: [2022-09-15T05:25:30.03625633Z]: [talos] task unmountSystemDiskBindMounts (1/1): unmounting /var
192.168.2.41: kern:  notice: [2022-09-15T05:25:30.20664633Z]: XFS (mmcblk0p6): Unmounting Filesystem
192.168.2.41: user: warning: [2022-09-15T05:25:30.23774033Z]: [talos] task unmountSystemDiskBindMounts (1/1): done, 230.619601ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.24776533Z]: [talos] phase unmountBind (6/12): done, 248.018495ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.25506533Z]: [talos] phase unmountSystem (7/12): 2 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.26515633Z]: [talos] task unmountStatePartition (2/2): starting
192.168.2.41: user: warning: [2022-09-15T05:25:30.27241633Z]: [talos] task unmountEphemeralPartition (1/2): starting
192.168.2.41: user: warning: [2022-09-15T05:25:30.28004533Z]: [talos] task unmountEphemeralPartition (1/2): done, 14.855459ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.28887433Z]: [talos] task unmountStatePartition (2/2): done, 14.981533ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.29647733Z]: [talos] phase unmountSystem (7/12): done, 41.419688ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.30350933Z]: [talos] phase mountBoot (8/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.30960733Z]: [talos] task mountBootPartition (1/1): starting
192.168.2.41: kern:  notice: [2022-09-15T05:25:30.32126133Z]: XFS (mmcblk0p3): Mounting V5 Filesystem
192.168.2.41: kern:    info: [2022-09-15T05:25:30.43072533Z]: XFS (mmcblk0p3): Ending clean mount
192.168.2.41: user: warning: [2022-09-15T05:25:30.43900733Z]: [talos] task mountBootPartition (1/1): done, 129.402745ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.44731833Z]: [talos] phase mountBoot (8/12): done, 143.80704ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.45484533Z]: [talos] phase kexec (9/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.46160633Z]: [talos] task kexecPrepare (1/1): starting
192.168.2.41: user: warning: [2022-09-15T05:25:30.47137333Z]: [talos] kexec support is disabled via sysctl
192.168.2.41: user: warning: [2022-09-15T05:25:30.47854233Z]: [talos] task kexecPrepare (1/1): done, 16.946506ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.48617833Z]: [talos] phase kexec (9/12): done, 31.339691ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.49305533Z]: [talos] phase unmountBoot (10/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.49996833Z]: [talos] task unmountBootPartition (1/1): starting
192.168.2.41: kern:  notice: [2022-09-15T05:25:30.50809933Z]: XFS (mmcblk0p3): Unmounting Filesystem
192.168.2.41: user: warning: [2022-09-15T05:25:30.51725133Z]: [talos] task unmountBootPartition (1/1): done, 17.274228ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.52504033Z]: [talos] phase unmountBoot (10/12): done, 32.002873ms
192.168.2.41: user: warning: [2022-09-15T05:25:30.53231433Z]: [talos] phase stopEverything (11/12): 1 tasks(s)
192.168.2.41: user: warning: [2022-09-15T05:25:30.53928833Z]: [talos] task stopAllServices (1/1): starting
1 error occurred:
 rpc error: code = Unknown desc = close /dev/kmsg: file already closed

Environment

frezbo commented 2 years ago

cm4 is not a supported talos platform yet. so I'm not sure which docs you've followed

frezbo commented 2 years ago

also without console logs, there's nothing much to provide help over, we don't have compute modules at hand

myndzi commented 2 years ago

(I am the OP, was just on a different chrome profile when I posted and realized it too late)

The RPi 4 instructions worked out of the box, though I went back and crafted yamls until I had one that worked with the configuration options I wanted. I've given you everything present on the screen before and after, do you mean something else by "console logs"?

If you tell me what info you want/need (and maybe point me in a direction of how to get it), I should be able to do that :)

frezbo commented 2 years ago

do you mean something else by "console logs"?

yes, not the one from talosctl dmesg since that stops once apid stops. Something from a serial console would be perfect.

frezbo commented 2 years ago

The RPi 4 instructions worked out of the box

the u-boot shipped for rpi4 talos image is not even the right one for cm4, so it's possible it would have issues

myndzi commented 2 years ago

Yeah, I didn't expect dmesg to be that helpful, but I thought it might give some context about the shutdown process that might stand out to someone on the project.

I'll take a video of the screen at reboot and see if anything stands out, then see what it would take to get serial output. I likely need to order at least one thing from Amazon in that case unfortunately.

I'm also comfortable with Go, Docker, etc. and could follow the development/deploy steps, make code change, and do other misc. stuff to gather info -- I'm only missing the context about what should be happening and what to look for / explore. I expect part of this might overlap with the rpi bootloader, and I'm hoping that since Talos seems at least tested against some rpis that perhaps someone on the project has the context about what might matter in that realm too.

I'll report back after I've either found an interesting error or got a full serial log from both boot and reboot.

(Edit: video shows that there are some warnings at the u-boot stage, but the text on screen for a successful boot appears to be the same as the text on screen for an unsuccessful boot. Will pursue the serial console thing)

frezbo commented 2 years ago

Yeah, I didn't expect dmesg to be that helpful, but I thought it might give some context about the shutdown process that might stand out to someone on the project.

it's all linux kernel at that point, nothing talos specific

but one thing that might be problematic is that the arm trusted firmware (ATF) in the rpi4 talos image is specific to rpi4 and it's something that manages PSCI, which managed the power state, so one assumption is that on CM4 it's not using the same chip or the rpi4 one is not compatible. Some reference to it is here in this PR: https://github.com/siderolabs/pkgs/pull/375

This can be easily confirmed by mounting the install medium (usb/sd card) to a machine, mounting the first partition and editing the config.txt and removing the armstub=bl31.bin and see if that fixes the power issue.

myndzi commented 2 years ago

This can be easily confirmed by mounting the install medium (usb/sd card) to a machine, mounting the first partition and editing the config.txt and removing the armstub=bl31.bin and see if that fixes the power issue.

That is well within my capabilities to test :) I have a little test IO board coming tomorrow or the next day, plus a serial<->usb adapter, so I'll pull one of the CM4s and see if I can repro on a different board, see if there are any revealing logs, and see if altering the armstub line changes the behavior. I was running into bl31.bin in my searches but don't fully understand the details of what it does and when it's necessary, but I was generally suspicious of the difference being in config.txt anyway -- just wasn't sure how / where to find it (I flashed so many images, but Windows was just spamming me with unformatted disk dialogs, so I may have missed a mountable one -- or I may need to mount it from another o/s).

(I was trying to see which options Talos even used, and even deployed a pod with /boot mounted but it was empty, so thought maybe Talos was doing something a little different in the context of raspberry pis. I later found the config file in this repo, so I have some idea what options are being set, but your advice hopefully gives me a direct line. It's made a bit more annoying since I only have EMMC modules, no "lite" ones)

I also found some references to similar restart problems with the CM4 (much older github issues) relating to certain power management stuff, one implicating the power supply and one implicating the i/o board (which, it turns out, might be the board I actually bought...), and one implicating indirectly the gpu_mem line (related also to power stuff).

At any rate, I have a queue of things I can poke at without needing to understand the entire boot sequence in the context of Talos -- but I'm willing to dive deep with some guidance :)

Update (edited in because nothing has really changed):

Update 2:

myndzi commented 2 years ago

OK, I have an update with some relevant information:

I'm not super sure of the power brick I hooked up to the CM4IO board; I thought it would take USB C, which I had, but instead had to scrounge around for something in my miscellaneous bin. The one I found is 12 volts at 2000 mA, and with a kill-a-watt hooked up while powering it on it never went past 3.1 watts or 0.05 amps of current. It could be possible that the boot failure on the CM4IO board is power-related, but seems unlikely since I can "kinda" reproduce on the Super6C (which mysteriously manages to boot from cold, at least). The Super6C with all 6 CM4s and all 6 NVMe drives didn't go past 20 watts, and the power brick is rated for 100 watts, making power problems seem less likely (but I'm still confused about why the boot logs contain an undervolt warning...). TLDR: I think the power is fine and probably not the problem here, though I've seen lots of power-related things implicated in similar problems in my searching.

Finally, here's the full cold boot log from Talos on the CM4IO with all the debug options on:

config.txt:

...
enable_uart=1
start_debug=1
dtoverlay=disable-bt,disable-wifi
uart_2ndstage=1

cold boot:

Read start4db.elf bytes  3739240 hnd 0x11061
Read fixup4db.dat bytes     8356 hnd 0x65c
Firmware: e5a963efa66a1974127860b42e913d2374139ff5 Mar 24 2022 13:19:58
0x00d03141 0x00000000 0x00000fff
MEM GPU: 32 ARM: 992 TOTAL: 1024
Starting start4db.elf @ 0xfec00200 partition 0
+

ASRT:00:00:03.818965:0: ../../../../../vcfw/drivers/chip/vciv/2708/clock_2711.c::clock_wait_busy!2151: timeout != 0
MESS:00:00:04.224240:0: arasan: arasan_emmc_open
MESS:00:00:04.744859:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:04.747811:0: brfs: File read: 603 bytes
MESS:00:00:04.870077:0: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.879593:0: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:04.889100:0: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:04.898614:0: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:04.908124:0: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:04.917638:0: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:04.927145:0: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:04.936660:0: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:04.946170:0: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:04.955684:0: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:04.960183:0: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:04.966605:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:04.971005:0: gpioman: gpioman_get_pin_num: pin DISPLAY_SDA not defined
MESS:00:00:04.977763:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:05.006149:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:05.010778:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
MESS:00:00:05.018519:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:05.025968:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
ASRT:00:00:05.438291:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
ASRT:00:00:05.455298:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
ASRT:00:00:05.473335:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
MESS:00:00:05.536352:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:05.542107:0: *** Restart logging
MESS:00:00:05.544634:0: brfs: File read: 603 bytes
MESS:00:00:05.613205:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:05.623238:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:05.640207:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:05.650299:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:05.660396:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:05.670482:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:05.680577:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:05.690668:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:05.700764:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:05.710850:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:05.715928:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:05.726587:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:05.736674:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:05.746770:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:05.756859:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:05.766955:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:05.777042:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:05.787138:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:05.797227:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:05.807323:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:05.817410:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:05.822487:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:05.835941:0: HDMI0: hdmi_pixel_encoding: 300000000
MESS:00:00:05.838570:0: HDMI1: hdmi_pixel_encoding: 300000000
MESS:00:00:05.844929:0: gpioman: gpioman_get_pin_num: pin CAMERA_0_I2C_PORT not defined
MESS:00:00:05.856041:0: dtb_file 'bcm2711-rpi-cm4.dtb'
MESS:00:00:05.863429:0: brfs: File read: /mfs/sd/bcm2711-rpi-cm4.dtb
MESS:00:00:05.866678:0: Loading 'bcm2711-rpi-cm4.dtb' to 0x100 size 0xcbcc
MESS:00:00:05.886444:0: brfs: File read: 52172 bytes
MESS:00:00:05.895949:0: brfs: File read: /mfs/sd/overlays/overlay_map.dtb
MESS:00:00:05.972808:0: brfs: File read: 1783 bytes
MESS:00:00:05.976617:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:05.980466:0: brfs: File read: 603 bytes
MESS:00:00:05.986762:0: brfs: File read: /mfs/sd/overlays/disable-bt.dtbo
MESS:00:00:06.005797:0: Loaded overlay 'disable-bt'
MESS:00:00:06.007563:0: dtparam: disable-wifi=true
MESS:00:00:06.015131:0: Unknown dtparam 'disable-wifi' - ignored
MESS:00:00:06.119528:0: brfs: File read: 1073 bytes
MESS:00:00:06.123788:0: Failed to open command line file 'cmdline.txt'
MESS:00:00:06.383260:0: brfs: File read: /mfs/sd/bl31.bin
MESS:00:00:06.385596:0: Loading 'bl31.bin' to 0x0 size 0xb06b
MESS:00:00:06.391099:0: brfs: File read: 45163 bytes
MESS:00:00:06.515991:0: brfs: File read: /mfs/sd/u-boot.bin
MESS:00:00:06.518508:0: Loading 'u-boot.bin' to 0x80000 size 0x928f8
MESS:00:00:06.524600:0: Device tree loaded to 0x2eff2e00 (size 0xd1b7)
MESS:00:00:06.534331:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:06.540677:0: uart: Baud rate change done...
MESS:00:00:06.542733:0: uart: Baud rate change done...
MESS:00:00:06.548264:0: gpioman: gpioman_get_pin_num: pin SDCARD_CONTROL_POWER not defined
NOTICE:  BL31: v2.6(release):
NOTICE:  BL31: Built : 18:08:32, Aug 12 2022

U-Boot 2022.04 (Aug 12 2022 - 18:08:35 +0000)

DRAM:  7.9 GiB
RPI Compute Module 4 (0xd03141)
Core:  201 devices, 13 uclasses, devicetree: board
MMC:   mmcnr@7e300000: 1, mmc@7e340000: 0
Loading Environment from FAT... Unable to read "uboot.env" from mmc0:1...
In:    serial
Out:   vidconsole
Err:   vidconsole
Net:   eth0: ethernet@7d580000
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
starting USB...
No working controllers found
Hit any key to stop autoboot:  0
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Card did not respond to voltage select! : -110
Scanning disk mmcnr@7e300000.blk...
Disk mmcnr@7e300000.blk not ready
Scanning disk mmc@7e340000.blk...
Found 7 disks
** Unable to read file ubootefi.var **
Failed to load EFI variables
BootOrder not defined
EFI boot manager: Cannot load any image
Found EFI removable media binary efi/boot/bootaa64.efi
147456 bytes read in 20 ms (7 MiB/s)
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Booting /efi\boot\bootaa64.efi
Welcome to GRUB!

                             GNU GRUB  version 2.06

 ┌────────────────────────────────────────────────────────────────────────────┐
 │*A - Talos v1.2.0                                                           │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 └────────────────────────────────────────────────────────────────────────────┘

      Use the ▲ and ▼ keys to select which entry is highlighted.
      Press enter to boot the selected OS, `e' to edit the commands
      before booting or `c' for a command-line.
   The highlighted entry will be executed automatically in 0s.
  Booting `A - Talos v1.2.0'

EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

(and here it hangs indefinitely)

I also have logs at this level of verbosity for:

... and all of these for the same module, in the same CM4IO board, with "raspberry pi os 64 bit lite" installed, for which everything succeeded. I did not have the NVMe adapter when I got those logs, so now I'm off to verify whether it behaves "normally" with NVME hooked up. (Edit: it does. boots, reboots, and shuts down fine)

If you want any of the above-mentioned logs, or have any tests to suggest, let me know :)

As far as I can tell, Talos seems to otherwise run fine on the CM4 just like it would on a Rpi4, but the pcie / nvme stuff may be where the divergence lies.

myndzi commented 2 years ago

By the way, I didn't explicitly say it yet I think, but I've flashed the latest Rpi bootloader before doing anything on each of these CM4s

Edit, unrelated: Any chance I can just merge the contents of /boot from a Raspbian install over to /EFI/boot on a Talos install and just change config.txt? I'm wondering if maybe Talos's image maybe just has some out-of-date stuff on it?

frezbo commented 2 years ago

I'm not super sure of the power brick I hooked up to the CM4IO board; I thought it would take USB C, which I had, but instead had to scrounge around for something in my miscellaneous bin. The one I found is 12 volts at 2000 mA, and with a kill-a-watt hooked up while powering it on it never went past 3.1 watts or 0.05 amps of current. It could be possible that the boot failure on the CM4IO board is power-related, but seems unlikely since I can "kinda" reproduce on the Super6C (which mysteriously manages to boot from cold, at least). The Super6C with all 6 CM4s and all 6 NVMe drives didn't go past 20 watts, and the power brick is rated for 100 watts, making power problems seem less likely (but I'm still confused about why the boot logs contain an undervolt warning...). TLDR: I think the power is fine and probably not the problem here, though I've seen lots of power-related things implicated in similar problems in my searching.

so regarding the power, it might be that even though the power brick can provide enough power, the VRM (voltage regulator module) for each nvme drive may not provide enough power. (This is an assumtption and would highly depend on the hardware design).

Edit, unrelated: Any chance I can just merge the contents of /boot from a Raspbian install over to /EFI/boot on a Talos install and just change config.txt? I'm wondering if maybe Talos's image maybe just has some out-of-date stuff on it?

so the problem here is that it may or may not work, since it's using combinations that are not explicitly designed for cm4. Eg u-boot is for rpi4 specific, armstub is rpi4 specific

frezbo commented 2 years ago

also as mentioned earlier with SBC's and CM4 having so many variant support boards, there's nothing much help we can do on our side to test unless we have the exact hardware to test.

frezbo commented 2 years ago

it might be worth testing with u-boot compiled for cm4 specific and replacing the u-boot.bin with the cm4 specific one.

frezbo commented 2 years ago

Attaching u-boot.bin as a gzipped tar archive, built with rpi_arm64_defconfig as mentioned here: https://github.com/u-boot/u-boot/blob/v2022.07/doc/board/broadcom/raspberrypi.rst#64-bit

u-boot.bin.tar.gz

myndzi commented 2 years ago

I can give that a shot.

also as mentioned earlier with SBC's and CM4 having so many variant support boards, there's nothing much help we can do on our side to test unless we have the exact hardware to test.

I can do build and test work with some guidance on what to do/expect/investigate too, but I'll start here :) (The main thing is, I don't understand much about how the boot process works in the level of detail I would need; I don't know which parts are replaceable and which need to work together or how)

myndzi commented 2 years ago

No luck with the alternate u-boot on the CM4IO board (with or without armstub disabled). I do have a different brand of M.2 stick on the way, too, just in case it changes anything

myndzi commented 2 years ago

Update: Continued to test varying arrangements; I got a successful reboot with the u-boot linked above, on the Super6C, with NVME attached and a beefier power supply. The fact that it booted even once suggests that any remaining problem is not likely to be "bits on disk" in any combination, though there's always the possibility that the hardware environment is creating a problem that some software can always deal with and other software gets caught out by.

Next up:

Annoyingly, there's still the brief "undervolt" warning in the boot logs, presumably from everything starting at the same time. I guess that's on the board's power controllers.

(is armstub Pi 4 specific? It didn't seem to make a difference whether I loaded it or not earlier)

frezbo commented 2 years ago

is armstub Pi 4 specific

yes, i believe so, I don't see one specific for cm4 https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/rpi4.rst

myndzi commented 2 years ago

The u-boot.bin you supplied seems to get the job done. I don't know why it doesn't work on the CM4IO board, but that's not the one I care about. Original power brick, all drives, and all CM4s reboots correctly with that file.

Thanks for your help, I believe this will be sufficient for my use; I'll work my way through the Talos build process so that I can have a repeatable sequence of steps for building updates too.

It may be suitable for a different issue/thread/channel of communication, but if you'd like for me to perform any specific test/verification steps beyond the things I happen to currently know about, for the purpose of publishing an "official" build that works for (at least one) CM4 setup, I'm happy to do whatever you need me to with the hardware I have; though, I'd completely understand if it makes more sense to wait until an actual contributor to the project has physical hardware they can develop on/with too.

frezbo commented 2 years ago

you'd start by adding a u-boot target to https://github.com/siderolabs/pkgs/blob/main/u-boot/pkg.yaml and then something similar as https://github.com/siderolabs/talos/pull/6111

myndzi commented 2 years ago

I believe I've successfully made the changes to both repositories, but I'm not sure how to integrate them in local testing...

In pkgs I was able to run REGISTRY=localhost:5005 PLATFORM=linux/arm64 make docker-u-boot, which succeeds; for expediency I'm just changing the Dockerfile line in the talos repository that fetches u-boot to point to my local registry. However, even after building and publishing imagerfrom the talos repository, (I thought maybe the error I'm seeing came from an old version which doesn't have the new item in the "constants" package, and that maybe rebuilding it would update that) I can't seem to get it to recognize the new board type: failed to create installation manifest: unsupported board: "rpi_cm4" (command: REGISTRY_AND_USERNAME=127.0.0.1:5005/siderolabs IMAGE_TAG=v1.2.3 PLATFORM=linux/arm64 make sbc-rpi_cm4)

I'm sure I'm missing something a little silly, or maybe I just have to build everything (I can wait for that, but I don't want to wait and flail repeatedly) -- any pointers?

(I'm trying to create a metal xz image that I can flash to verify that my changes have the expected effect)

frezbo commented 2 years ago

so on u-boot side run make u-boot REGISTRY=localhost:5005 PLATFORM=linux/arm64 PUSH=true that would push it to the local registry, then on talos side edit the dockerfile and change the pkgs reference for u-boot to the locally published image. Then once you've added a new rpi_cm4 target, run a make installer REGISTRY=localhost:5005 PUSH=true

then after that make sbc-rpi_cm4

failed to create installation manifest: unsupported board: "rpi_cm4"

that probably means you missed to register it as shown here: https://github.com/siderolabs/talos/pull/6111/files#diff-35c1b024cdea1e459b51cadb40f31e777db09c8bff73a5454f304a472974e98cR69

myndzi commented 2 years ago

I did add it there, but everything else you mentioned is what I did. I'll double-check the code, I wasn't entirely sure if "installer" was the image that is actually running the code that checks; I may have typoed something somewhere. My best guess is I'm running the wrong image, which is why I was trying to set the registry/tag arguments. Thanks!

Edit: found my way there. Had to set IMAGE_TAG=v1.2.3-1-g9a281239d along with REGISTRY on the make sbc command: make sbc-rpi_4 REGISTRY=localhost:5005 IMAGE_TAG=v1.2.3-1-g9a281239d. Thought I tried this yesterday, but must have got some other steps wrong. New error, but at least I know the unknowns now :)

Edit 2: Full steps (from a macbook m1) for posterity:

Now to make sure that it actually boots on this image :)

myndzi commented 2 years ago

The image I built flashes and boots correctly, but doesn't reboot :(

Either I compiled u-boot differently or I didn't use the correct u-boot bin when compiling the talos image. I gave it a different filename specifically to ensure everything lined up though, so my guess is the former. There's nothing much to do or change, though; I just made it with the arm64 default config: https://github.com/siderolabs/pkgs/compare/main...myndzi:pkgs:myndzi/cm4-super6c#diff-4fa7754436569ad36bda7353cbcde28216245cf1cd5873d46800e5e2023b5fd2R90

If this is how you built the zip you provided me, do you see anything different there?

frezbo commented 2 years ago

that looks correct, built the u-boot as same. I think you should remove the armstub though, since it's rpi4 specific

myndzi commented 2 years ago

I thought I had -- will double-check (edit: it is not present, as expected). Will also just replace u-boot.bin with the zip you provided to see if it "works" again. Won't be able to poke at things for a while yet, thanks for confirming though.

Edit: it seems to possibly be copying the wrong file, not sure why yet. Building a docker image that copies the u-boot tree from the pkgs-published image shows that the cm4 version is present, and the Install receiver for the RPiCM4 struct does reference the correct path, and the config.txt file reflects the one I modified/added for the cm4. But the file on disk on the CM4 after flashing the sbc xz is larger than it should be, same size as the rpi_4 one. Also, for whatever reason, after a soft reboot (not a cold boot; it happened after the initial flash->boot and also after manual reboot commands), the API is unresponsive with an error code = Unavailable desc = connection closed before server preface received - don't know what the deal is there yet. Cold booting makes it responsive to the API again.

myndzi commented 2 years ago

OK, so I have a repeatable thing that works -- mostly. I had been working from the latest stable release branch of both repositories, but the filesize of the u-boot.bin I was generating was different from yours. I tried the full sequence listed above, but from the latest main branch of both repositories and it works. I'm not sure whether I care to work out whether the difference is the u-boot version (the latest main includes a version bump from pkgs@1.2.0), or the build configuration.

I'm getting the "connection closed before server preface received" problem still, sometimes (e.g. after a reboot, but not on a cold boot). Any ideas on what could be the cause of that, or how to debug further? (is there some verbose/debug argument to talosctl I can use to see details?)

When this happens, the console displays "boot sequence: done" and all the tasks have completed, and there don't seem to be any recurring errors like there can be when some services haven't started up yet. I'm confused by this one since I expect it to be a software thing (e.g. the HTTP server), but the code here is the same - once things have booted up, I expect them to work the same...

nmap shows ports 50000 and 50001 open, so something's listening but apparently dropping the connection rather than respond. Nothing shows on the console when this happens.

frezbo commented 2 years ago

you should use pkgs from the main branch. Don't rely on the file sizes as they might differ from build environment. For testing I just built it locally which is not reproducible, as opposed to building from pkgs.

PR's always welcome.

I'm getting the "connection closed before server preface received"

this means wrong endpoint or talosconfig. Please note to not re-use machineconfig across new installs. Please generate a new one.

myndzi commented 2 years ago

Oh, interesting. That's likely it; I was reflashing the same config. Not sure why it works "sometimes" though.

I know the filesize can vary, but I didn't have any other ideas at the time.

Sorry if I wasn't clear about pkgs - the working one is indeed branched from main. My original intent was to set up a stable branch that I could merge or rebase over updates, so that even if I couldn't wrangle a PR and get it merged, I could have a reproducible setup with which to build updated versions. It was for that reason I started with branches from the latest release. I'm now curious, though, if it was the u-boot version update rather than the build argument that solved the problem; I can check that with a fresh cm4 build, so I will.

Fingers crossed that a pristine machineconfig sorts the last problem!

frezbo commented 2 years ago

I was reflashing the same config. Not sure why it works "sometimes" though.

there's nothing wrong in re-using an existing one, just that old node data in discovery service gets garbage collected only after 30 mins, so there might be little hiccup re-using configs when testing.

myndzi commented 2 years ago

I didn't think it would matter since I'm just using one node to test and completely wiping it. Anyway, generating new config seems to have worked :+1:

Interestingly, the stock rpi_4 image works when built from main (with the u-boot update) -- so the trouble may just have been that. I'm still going to use the "proper" build and send a PR when I'm done, though. I'm just renaming it and testing the final thing now.

I made a little script to generate my configs with static ips, extra hosts, setting up the nvme disk, etc. I could swear I saw somewhere in the docs something about copying the generated worker.yaml and modifying for a case like this, but I can't find it now. I was going to ask if that's actually the way to go -- your response suggests that should be okay after all. I guess the same tactic works for control plane nodes? Although those have the certificate private key in them, too; I'm not sure if they should each have their own or what.

Anyway, that's getting a bit far afield for this issue.

Image, boot, setup, bootstrap, reboot, and consistent talosctl all are working now :tada:

All that remains is to go through the whole process for the renamed build target and send PRs :)

Your timely responses have been super appreciated and helpful! I learned a good chunk of stuff, and am looking forward to playing around with talos (and kubernetes) some more. And cleaning up the disaster area of boxes that is currently my office.

frezbo commented 2 years ago

I could swear I saw somewhere in the docs something about copying the generated worker.yaml and modifying for a case like this, but I can't find it now.

yes, that's correct. Except when you're re-installing talos on the same node which might have same ip/hostname assigned continuously. It would take more time to stabilize due to old discovery data. Not a big deal. When testing and re-installing continuously it's better to use a fresh config.

myndzi commented 2 years ago

I sent https://github.com/siderolabs/pkgs/pull/593 and https://github.com/siderolabs/talos/pull/6318 - I think we can close this, since they don't directly resolve it, but there's no proper resolution to be had. Or keep it open until they merge, whichever :) Let me know if there's anything else I can do, I left comments in the PRs where things seemed not to work out but probably will in the CI system proper, or at merge time.

frezbo commented 2 years ago

I sent siderolabs/pkgs#593 and #6318 - I think we can close this, since they don't directly resolve it, but there's no proper resolution to be had. Or keep it open until they merge, whichever :) Let me know if there's anything else I can do, I left comments in the PRs where things seemed not to work out but probably will in the CI system proper, or at merge time.

awesome, thanks, I was thinking along the lines of removing the rpi4 specific one and just using the generic one, will discuss internally. but a generic one makes more sense now. We'd only official support tested boards which we have ta hand, but community can test with other variants.

myndzi commented 2 years ago

Yeah, it's a bit of a crapshoot having to support "everything", though I guess that's how it be with distros anyway. To that end, I never did get the CM4IO to boot with the NVME drive (via the PCIE adapter), but I found my way to a solution before I had to order a better power supply, so the jury's out on that one. For Talos, I think the Super6C (and eventually the Turing Pi 2), etc. make some sense to support.

For what it's worth, it took me a while but I was able to gather these CM4s over the course of a month or two. If you have the interest/budget and are just lacking the availability, I could keep lurking for others and acquire them for you?