Closed mmdevworkshop closed 2 years ago
cm4 is not a supported talos platform yet. so I'm not sure which docs you've followed
also without console logs, there's nothing much to provide help over, we don't have compute modules at hand
(I am the OP, was just on a different chrome profile when I posted and realized it too late)
The RPi 4 instructions worked out of the box, though I went back and crafted yamls until I had one that worked with the configuration options I wanted. I've given you everything present on the screen before and after, do you mean something else by "console logs"?
If you tell me what info you want/need (and maybe point me in a direction of how to get it), I should be able to do that :)
do you mean something else by "console logs"?
yes, not the one from talosctl dmesg
since that stops once apid stops. Something from a serial console would be perfect.
The RPi 4 instructions worked out of the box
the u-boot shipped for rpi4 talos image is not even the right one for cm4, so it's possible it would have issues
Yeah, I didn't expect dmesg to be that helpful, but I thought it might give some context about the shutdown process that might stand out to someone on the project.
I'll take a video of the screen at reboot and see if anything stands out, then see what it would take to get serial output. I likely need to order at least one thing from Amazon in that case unfortunately.
I'm also comfortable with Go, Docker, etc. and could follow the development/deploy steps, make code change, and do other misc. stuff to gather info -- I'm only missing the context about what should be happening and what to look for / explore. I expect part of this might overlap with the rpi bootloader, and I'm hoping that since Talos seems at least tested against some rpis that perhaps someone on the project has the context about what might matter in that realm too.
I'll report back after I've either found an interesting error or got a full serial log from both boot and reboot.
(Edit: video shows that there are some warnings at the u-boot stage, but the text on screen for a successful boot appears to be the same as the text on screen for an unsuccessful boot. Will pursue the serial console thing)
Yeah, I didn't expect dmesg to be that helpful, but I thought it might give some context about the shutdown process that might stand out to someone on the project.
it's all linux kernel at that point, nothing talos specific
but one thing that might be problematic is that the arm trusted firmware (ATF) in the rpi4 talos image is specific to rpi4 and it's something that manages PSCI, which managed the power state, so one assumption is that on CM4 it's not using the same chip or the rpi4 one is not compatible. Some reference to it is here in this PR: https://github.com/siderolabs/pkgs/pull/375
This can be easily confirmed by mounting the install medium (usb/sd card) to a machine, mounting the first partition and editing the config.txt
and removing the armstub=bl31.bin
and see if that fixes the power issue.
This can be easily confirmed by mounting the install medium (usb/sd card) to a machine, mounting the first partition and editing the
config.txt
and removing thearmstub=bl31.bin
and see if that fixes the power issue.
That is well within my capabilities to test :) I have a little test IO board coming tomorrow or the next day, plus a serial<->usb adapter, so I'll pull one of the CM4s and see if I can repro on a different board, see if there are any revealing logs, and see if altering the armstub line changes the behavior. I was running into bl31.bin in my searches but don't fully understand the details of what it does and when it's necessary, but I was generally suspicious of the difference being in config.txt anyway -- just wasn't sure how / where to find it (I flashed so many images, but Windows was just spamming me with unformatted disk dialogs, so I may have missed a mountable one -- or I may need to mount it from another o/s).
(I was trying to see which options Talos even used, and even deployed a pod with /boot mounted but it was empty, so thought maybe Talos was doing something a little different in the context of raspberry pis. I later found the config file in this repo, so I have some idea what options are being set, but your advice hopefully gives me a direct line. It's made a bit more annoying since I only have EMMC modules, no "lite" ones)
I also found some references to similar restart problems with the CM4 (much older github issues) relating to certain power management stuff, one implicating the power supply and one implicating the i/o board (which, it turns out, might be the board I actually bought...), and one implicating indirectly the gpu_mem line (related also to power stuff).
At any rate, I have a queue of things I can poke at without needing to understand the entire boot sequence in the context of Talos -- but I'm willing to dive deep with some guidance :)
Update (edited in because nothing has really changed):
Update 2:
OK, I have an update with some relevant information:
I'm not super sure of the power brick I hooked up to the CM4IO board; I thought it would take USB C, which I had, but instead had to scrounge around for something in my miscellaneous bin. The one I found is 12 volts at 2000 mA, and with a kill-a-watt hooked up while powering it on it never went past 3.1 watts or 0.05 amps of current. It could be possible that the boot failure on the CM4IO board is power-related, but seems unlikely since I can "kinda" reproduce on the Super6C (which mysteriously manages to boot from cold, at least). The Super6C with all 6 CM4s and all 6 NVMe drives didn't go past 20 watts, and the power brick is rated for 100 watts, making power problems seem less likely (but I'm still confused about why the boot logs contain an undervolt warning...). TLDR: I think the power is fine and probably not the problem here, though I've seen lots of power-related things implicated in similar problems in my searching.
Finally, here's the full cold boot log from Talos on the CM4IO with all the debug options on:
config.txt:
...
enable_uart=1
start_debug=1
dtoverlay=disable-bt,disable-wifi
uart_2ndstage=1
cold boot:
Read start4db.elf bytes 3739240 hnd 0x11061
Read fixup4db.dat bytes 8356 hnd 0x65c
Firmware: e5a963efa66a1974127860b42e913d2374139ff5 Mar 24 2022 13:19:58
0x00d03141 0x00000000 0x00000fff
MEM GPU: 32 ARM: 992 TOTAL: 1024
Starting start4db.elf @ 0xfec00200 partition 0
+
ASRT:00:00:03.818965:0: ../../../../../vcfw/drivers/chip/vciv/2708/clock_2711.c::clock_wait_busy!2151: timeout != 0
MESS:00:00:04.224240:0: arasan: arasan_emmc_open
MESS:00:00:04.744859:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:04.747811:0: brfs: File read: 603 bytes
MESS:00:00:04.870077:0: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.879593:0: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:04.889100:0: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:04.898614:0: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:04.908124:0: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:04.917638:0: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:04.927145:0: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:04.936660:0: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:04.946170:0: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:04.955684:0: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:04.960183:0: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:04.966605:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:04.971005:0: gpioman: gpioman_get_pin_num: pin DISPLAY_SDA not defined
MESS:00:00:04.977763:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:05.006149:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:05.010778:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
MESS:00:00:05.018519:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:05.025968:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
ASRT:00:00:05.438291:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
ASRT:00:00:05.455298:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
ASRT:00:00:05.473335:0: ../../../../../vcfw/drivers/chip/vciv/2708/power.c::power_set_voltage!3436: uV >= platform_get_min_core_voltage() * 1000 && uV <= platform_get_max_core_voltage() * 1000 + (BCM2836 ? 12500:0)
MESS:00:00:05.536352:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:05.542107:0: *** Restart logging
MESS:00:00:05.544634:0: brfs: File read: 603 bytes
MESS:00:00:05.613205:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:05.623238:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:05.640207:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:05.650299:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:05.660396:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:05.670482:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:05.680577:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:05.690668:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:05.700764:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:05.710850:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:05.715928:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:05.726587:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:05.736674:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 1
MESS:00:00:05.746770:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 2
MESS:00:00:05.756859:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 3
MESS:00:00:05.766955:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 4
MESS:00:00:05.777042:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 5
MESS:00:00:05.787138:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 6
MESS:00:00:05.797227:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 7
MESS:00:00:05.807323:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 8
MESS:00:00:05.817410:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 9
MESS:00:00:05.822487:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:05.835941:0: HDMI0: hdmi_pixel_encoding: 300000000
MESS:00:00:05.838570:0: HDMI1: hdmi_pixel_encoding: 300000000
MESS:00:00:05.844929:0: gpioman: gpioman_get_pin_num: pin CAMERA_0_I2C_PORT not defined
MESS:00:00:05.856041:0: dtb_file 'bcm2711-rpi-cm4.dtb'
MESS:00:00:05.863429:0: brfs: File read: /mfs/sd/bcm2711-rpi-cm4.dtb
MESS:00:00:05.866678:0: Loading 'bcm2711-rpi-cm4.dtb' to 0x100 size 0xcbcc
MESS:00:00:05.886444:0: brfs: File read: 52172 bytes
MESS:00:00:05.895949:0: brfs: File read: /mfs/sd/overlays/overlay_map.dtb
MESS:00:00:05.972808:0: brfs: File read: 1783 bytes
MESS:00:00:05.976617:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:05.980466:0: brfs: File read: 603 bytes
MESS:00:00:05.986762:0: brfs: File read: /mfs/sd/overlays/disable-bt.dtbo
MESS:00:00:06.005797:0: Loaded overlay 'disable-bt'
MESS:00:00:06.007563:0: dtparam: disable-wifi=true
MESS:00:00:06.015131:0: Unknown dtparam 'disable-wifi' - ignored
MESS:00:00:06.119528:0: brfs: File read: 1073 bytes
MESS:00:00:06.123788:0: Failed to open command line file 'cmdline.txt'
MESS:00:00:06.383260:0: brfs: File read: /mfs/sd/bl31.bin
MESS:00:00:06.385596:0: Loading 'bl31.bin' to 0x0 size 0xb06b
MESS:00:00:06.391099:0: brfs: File read: 45163 bytes
MESS:00:00:06.515991:0: brfs: File read: /mfs/sd/u-boot.bin
MESS:00:00:06.518508:0: Loading 'u-boot.bin' to 0x80000 size 0x928f8
MESS:00:00:06.524600:0: Device tree loaded to 0x2eff2e00 (size 0xd1b7)
MESS:00:00:06.534331:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:06.540677:0: uart: Baud rate change done...
MESS:00:00:06.542733:0: uart: Baud rate change done...
MESS:00:00:06.548264:0: gpioman: gpioman_get_pin_num: pin SDCARD_CONTROL_POWER not defined
NOTICE: BL31: v2.6(release):
NOTICE: BL31: Built : 18:08:32, Aug 12 2022
U-Boot 2022.04 (Aug 12 2022 - 18:08:35 +0000)
DRAM: 7.9 GiB
RPI Compute Module 4 (0xd03141)
Core: 201 devices, 13 uclasses, devicetree: board
MMC: mmcnr@7e300000: 1, mmc@7e340000: 0
Loading Environment from FAT... Unable to read "uboot.env" from mmc0:1...
In: serial
Out: vidconsole
Err: vidconsole
Net: eth0: ethernet@7d580000
PCIe BRCM: link up, 5.0 Gbps x1 (SSC)
starting USB...
No working controllers found
Hit any key to stop autoboot: 0
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Card did not respond to voltage select! : -110
Scanning disk mmcnr@7e300000.blk...
Disk mmcnr@7e300000.blk not ready
Scanning disk mmc@7e340000.blk...
Found 7 disks
** Unable to read file ubootefi.var **
Failed to load EFI variables
BootOrder not defined
EFI boot manager: Cannot load any image
Found EFI removable media binary efi/boot/bootaa64.efi
147456 bytes read in 20 ms (7 MiB/s)
libfdt fdt_check_header(): FDT_ERR_BADMAGIC
Booting /efi\boot\bootaa64.efi
Welcome to GRUB!
GNU GRUB version 2.06
┌────────────────────────────────────────────────────────────────────────────┐
│*A - Talos v1.2.0 │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
└────────────────────────────────────────────────────────────────────────────┘
Use the ▲ and ▼ keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
before booting or `c' for a command-line.
The highlighted entry will be executed automatically in 0s.
Booting `A - Talos v1.2.0'
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...
(and here it hangs indefinitely)
I also have logs at this level of verbosity for:
... and all of these for the same module, in the same CM4IO board, with "raspberry pi os 64 bit lite" installed, for which everything succeeded. I did not have the NVMe adapter when I got those logs, so now I'm off to verify whether it behaves "normally" with NVME hooked up. (Edit: it does. boots, reboots, and shuts down fine)
If you want any of the above-mentioned logs, or have any tests to suggest, let me know :)
As far as I can tell, Talos seems to otherwise run fine on the CM4 just like it would on a Rpi4, but the pcie / nvme stuff may be where the divergence lies.
By the way, I didn't explicitly say it yet I think, but I've flashed the latest Rpi bootloader before doing anything on each of these CM4s
Edit, unrelated: Any chance I can just merge the contents of /boot from a Raspbian install over to /EFI/boot on a Talos install and just change config.txt? I'm wondering if maybe Talos's image maybe just has some out-of-date stuff on it?
I'm not super sure of the power brick I hooked up to the CM4IO board; I thought it would take USB C, which I had, but instead had to scrounge around for something in my miscellaneous bin. The one I found is 12 volts at 2000 mA, and with a kill-a-watt hooked up while powering it on it never went past 3.1 watts or 0.05 amps of current. It could be possible that the boot failure on the CM4IO board is power-related, but seems unlikely since I can "kinda" reproduce on the Super6C (which mysteriously manages to boot from cold, at least). The Super6C with all 6 CM4s and all 6 NVMe drives didn't go past 20 watts, and the power brick is rated for 100 watts, making power problems seem less likely (but I'm still confused about why the boot logs contain an undervolt warning...). TLDR: I think the power is fine and probably not the problem here, though I've seen lots of power-related things implicated in similar problems in my searching.
so regarding the power, it might be that even though the power brick can provide enough power, the VRM (voltage regulator module) for each nvme drive may not provide enough power. (This is an assumtption and would highly depend on the hardware design).
Edit, unrelated: Any chance I can just merge the contents of /boot from a Raspbian install over to /EFI/boot on a Talos install and just change config.txt? I'm wondering if maybe Talos's image maybe just has some out-of-date stuff on it?
so the problem here is that it may or may not work, since it's using combinations that are not explicitly designed for cm4. Eg u-boot is for rpi4 specific, armstub is rpi4 specific
also as mentioned earlier with SBC's and CM4 having so many variant support boards, there's nothing much help we can do on our side to test unless we have the exact hardware to test.
it might be worth testing with u-boot compiled for cm4 specific and replacing the u-boot.bin
with the cm4 specific one.
Attaching u-boot.bin
as a gzipped tar archive, built with rpi_arm64_defconfig
as mentioned here: https://github.com/u-boot/u-boot/blob/v2022.07/doc/board/broadcom/raspberrypi.rst#64-bit
I can give that a shot.
also as mentioned earlier with SBC's and CM4 having so many variant support boards, there's nothing much help we can do on our side to test unless we have the exact hardware to test.
I can do build and test work with some guidance on what to do/expect/investigate too, but I'll start here :) (The main thing is, I don't understand much about how the boot process works in the level of detail I would need; I don't know which parts are replaceable and which need to work together or how)
No luck with the alternate u-boot on the CM4IO board (with or without armstub disabled). I do have a different brand of M.2 stick on the way, too, just in case it changes anything
Update: Continued to test varying arrangements; I got a successful reboot with the u-boot linked above, on the Super6C, with NVME attached and a beefier power supply. The fact that it booted even once suggests that any remaining problem is not likely to be "bits on disk" in any combination, though there's always the possibility that the hardware environment is creating a problem that some software can always deal with and other software gets caught out by.
Next up:
Annoyingly, there's still the brief "undervolt" warning in the boot logs, presumably from everything starting at the same time. I guess that's on the board's power controllers.
(is armstub Pi 4 specific? It didn't seem to make a difference whether I loaded it or not earlier)
is armstub Pi 4 specific
yes, i believe so, I don't see one specific for cm4 https://github.com/ARM-software/arm-trusted-firmware/blob/master/docs/plat/rpi4.rst
The u-boot.bin you supplied seems to get the job done. I don't know why it doesn't work on the CM4IO board, but that's not the one I care about. Original power brick, all drives, and all CM4s reboots correctly with that file.
Thanks for your help, I believe this will be sufficient for my use; I'll work my way through the Talos build process so that I can have a repeatable sequence of steps for building updates too.
It may be suitable for a different issue/thread/channel of communication, but if you'd like for me to perform any specific test/verification steps beyond the things I happen to currently know about, for the purpose of publishing an "official" build that works for (at least one) CM4 setup, I'm happy to do whatever you need me to with the hardware I have; though, I'd completely understand if it makes more sense to wait until an actual contributor to the project has physical hardware they can develop on/with too.
you'd start by adding a u-boot target to https://github.com/siderolabs/pkgs/blob/main/u-boot/pkg.yaml and then something similar as https://github.com/siderolabs/talos/pull/6111
I believe I've successfully made the changes to both repositories, but I'm not sure how to integrate them in local testing...
In pkgs
I was able to run REGISTRY=localhost:5005 PLATFORM=linux/arm64 make docker-u-boot
, which succeeds; for expediency I'm just changing the Dockerfile line in the talos
repository that fetches u-boot to point to my local registry. However, even after building and publishing imager
from the talos repository, (I thought maybe the error I'm seeing came from an old version which doesn't have the new item in the "constants" package, and that maybe rebuilding it would update that) I can't seem to get it to recognize the new board type: failed to create installation manifest: unsupported board: "rpi_cm4"
(command: REGISTRY_AND_USERNAME=127.0.0.1:5005/siderolabs IMAGE_TAG=v1.2.3 PLATFORM=linux/arm64 make sbc-rpi_cm4
)
I'm sure I'm missing something a little silly, or maybe I just have to build everything (I can wait for that, but I don't want to wait and flail repeatedly) -- any pointers?
(I'm trying to create a metal xz image that I can flash to verify that my changes have the expected effect)
so on u-boot side run make u-boot REGISTRY=localhost:5005 PLATFORM=linux/arm64 PUSH=true
that would push it to the local registry, then on talos
side edit the dockerfile and change the pkgs
reference for u-boot to the locally published image. Then once you've added a new rpi_cm4
target, run a make installer REGISTRY=localhost:5005 PUSH=true
then after that make sbc-rpi_cm4
failed to create installation manifest: unsupported board: "rpi_cm4"
that probably means you missed to register it as shown here: https://github.com/siderolabs/talos/pull/6111/files#diff-35c1b024cdea1e459b51cadb40f31e777db09c8bff73a5454f304a472974e98cR69
I did add it there, but everything else you mentioned is what I did. I'll double-check the code, I wasn't entirely sure if "installer" was the image that is actually running the code that checks; I may have typoed something somewhere. My best guess is I'm running the wrong image, which is why I was trying to set the registry/tag arguments. Thanks!
Edit: found my way there. Had to set IMAGE_TAG=v1.2.3-1-g9a281239d
along with REGISTRY on the make sbc command: make sbc-rpi_4 REGISTRY=localhost:5005 IMAGE_TAG=v1.2.3-1-g9a281239d
. Thought I tried this yesterday, but must have got some other steps wrong. New error, but at least I know the unknowns now :)
Edit 2: Full steps (from a macbook m1) for posterity:
-dirty
pkgs % make u-boot REGISTRY=localhost:5005 PLATFORM=linux/arm64 PUSH=true
=> pushing manifest for localhost:5005/siderolabs/u-boot:v1.2.0-1-g826c116@sha256:5dd875595cf7452e6c075404b804b7b33f41430774f08fdedd7ef0dfc6d192c4
talos
Dockerfile (around line 84) to reference the local registry and the tag which was published above:FROM --platform=arm64 ghcr.io/siderolabs/u-boot:${PKGS} AS pkg-u-boot-arm64
becomes:FROM --platform=arm64 localhost:5005/siderolabs/u-boot:v1.2.0-1-g826c116 AS pkg-u-boot-arm64
talos % make installer PLATFORM=linux/arm64 REGISTRY=localhost:5005 PUSH=true
=> pushing manifest for localhost:5005/siderolabs/installer:v1.2.3-1-gc0e16bdca@sha256:913f3f8be5e43212659b33a93c68bd9ff7cb3022c74ef46101d858a363b9be86
talos % make imager PLATFORM=linux/arm64 REGISTRY=localhost:5005 PUSH=true
=> pushing manifest for localhost:5005/siderolabs/imager:v1.2.3-1-gc0e16bdca@sha256:52cca57d1b9fe788efffe1f4029503f51f5b9cf0fec7e509d02275296c7b32c8
- should be the same tag as installertalos % make sbc-rpi_cm4 REGISTRY=localhost:5005 IMAGE_TAG=v1.2.3-1-gc0e16bdca
./_out
, e.g. ./_out/metal-rpi_cm4-arm64.img.xz
Now to make sure that it actually boots on this image :)
The image I built flashes and boots correctly, but doesn't reboot :(
Either I compiled u-boot differently or I didn't use the correct u-boot bin when compiling the talos image. I gave it a different filename specifically to ensure everything lined up though, so my guess is the former. There's nothing much to do or change, though; I just made it with the arm64 default config: https://github.com/siderolabs/pkgs/compare/main...myndzi:pkgs:myndzi/cm4-super6c#diff-4fa7754436569ad36bda7353cbcde28216245cf1cd5873d46800e5e2023b5fd2R90
If this is how you built the zip you provided me, do you see anything different there?
that looks correct, built the u-boot as same. I think you should remove the armstub
though, since it's rpi4 specific
I thought I had -- will double-check (edit: it is not present, as expected). Will also just replace u-boot.bin with the zip you provided to see if it "works" again. Won't be able to poke at things for a while yet, thanks for confirming though.
Edit: it seems to possibly be copying the wrong file, not sure why yet. Building a docker image that copies the u-boot tree from the pkgs
-published image shows that the cm4 version is present, and the Install receiver for the RPiCM4 struct does reference the correct path, and the config.txt
file reflects the one I modified/added for the cm4. But the file on disk on the CM4 after flashing the sbc xz is larger than it should be, same size as the rpi_4 one. Also, for whatever reason, after a soft reboot (not a cold boot; it happened after the initial flash->boot and also after manual reboot commands), the API is unresponsive with an error code = Unavailable desc = connection closed before server preface received
- don't know what the deal is there yet. Cold booting makes it responsive to the API again.
OK, so I have a repeatable thing that works -- mostly. I had been working from the latest stable release branch of both repositories, but the filesize of the u-boot.bin I was generating was different from yours. I tried the full sequence listed above, but from the latest main branch of both repositories and it works. I'm not sure whether I care to work out whether the difference is the u-boot version (the latest main includes a version bump from pkgs@1.2.0), or the build configuration.
I'm getting the "connection closed before server preface received" problem still, sometimes (e.g. after a reboot, but not on a cold boot). Any ideas on what could be the cause of that, or how to debug further? (is there some verbose/debug argument to talosctl I can use to see details?)
When this happens, the console displays "boot sequence: done" and all the tasks have completed, and there don't seem to be any recurring errors like there can be when some services haven't started up yet. I'm confused by this one since I expect it to be a software thing (e.g. the HTTP server), but the code here is the same - once things have booted up, I expect them to work the same...
nmap shows ports 50000 and 50001 open, so something's listening but apparently dropping the connection rather than respond. Nothing shows on the console when this happens.
you should use pkgs
from the main branch. Don't rely on the file sizes as they might differ from build environment. For testing I just built it locally which is not reproducible, as opposed to building from pkgs
.
PR's always welcome.
I'm getting the "connection closed before server preface received"
this means wrong endpoint or talosconfig. Please note to not re-use machineconfig
across new installs. Please generate a new one.
Oh, interesting. That's likely it; I was reflashing the same config. Not sure why it works "sometimes" though.
I know the filesize can vary, but I didn't have any other ideas at the time.
Sorry if I wasn't clear about pkgs
- the working one is indeed branched from main. My original intent was to set up a stable branch that I could merge or rebase over updates, so that even if I couldn't wrangle a PR and get it merged, I could have a reproducible setup with which to build updated versions. It was for that reason I started with branches from the latest release. I'm now curious, though, if it was the u-boot version update rather than the build argument that solved the problem; I can check that with a fresh cm4 build, so I will.
Fingers crossed that a pristine machineconfig sorts the last problem!
I was reflashing the same config. Not sure why it works "sometimes" though.
there's nothing wrong in re-using an existing one, just that old node data in discovery service gets garbage collected only after 30 mins, so there might be little hiccup re-using configs when testing.
I didn't think it would matter since I'm just using one node to test and completely wiping it. Anyway, generating new config seems to have worked :+1:
Interestingly, the stock rpi_4 image works when built from main (with the u-boot update) -- so the trouble may just have been that. I'm still going to use the "proper" build and send a PR when I'm done, though. I'm just renaming it and testing the final thing now.
I made a little script to generate my configs with static ips, extra hosts, setting up the nvme disk, etc. I could swear I saw somewhere in the docs something about copying the generated worker.yaml and modifying for a case like this, but I can't find it now. I was going to ask if that's actually the way to go -- your response suggests that should be okay after all. I guess the same tactic works for control plane nodes? Although those have the certificate private key in them, too; I'm not sure if they should each have their own or what.
Anyway, that's getting a bit far afield for this issue.
Image, boot, setup, bootstrap, reboot, and consistent talosctl all are working now :tada:
All that remains is to go through the whole process for the renamed build target and send PRs :)
Your timely responses have been super appreciated and helpful! I learned a good chunk of stuff, and am looking forward to playing around with talos (and kubernetes) some more. And cleaning up the disaster area of boxes that is currently my office.
I could swear I saw somewhere in the docs something about copying the generated worker.yaml and modifying for a case like this, but I can't find it now.
yes, that's correct. Except when you're re-installing talos on the same node which might have same ip/hostname assigned continuously. It would take more time to stabilize due to old discovery data. Not a big deal. When testing and re-installing continuously it's better to use a fresh config.
I sent https://github.com/siderolabs/pkgs/pull/593 and https://github.com/siderolabs/talos/pull/6318 - I think we can close this, since they don't directly resolve it, but there's no proper resolution to be had. Or keep it open until they merge, whichever :) Let me know if there's anything else I can do, I left comments in the PRs where things seemed not to work out but probably will in the CI system proper, or at merge time.
I sent siderolabs/pkgs#593 and #6318 - I think we can close this, since they don't directly resolve it, but there's no proper resolution to be had. Or keep it open until they merge, whichever :) Let me know if there's anything else I can do, I left comments in the PRs where things seemed not to work out but probably will in the CI system proper, or at merge time.
awesome, thanks, I was thinking along the lines of removing the rpi4 specific one and just using the generic one, will discuss internally. but a generic one makes more sense now. We'd only official support tested boards which we have ta hand, but community can test with other variants.
Yeah, it's a bit of a crapshoot having to support "everything", though I guess that's how it be with distros anyway. To that end, I never did get the CM4IO to boot with the NVME drive (via the PCIE adapter), but I found my way to a solution before I had to order a better power supply, so the jury's out on that one. For Talos, I think the Super6C (and eventually the Turing Pi 2), etc. make some sense to support.
For what it's worth, it took me a while but I was able to gather these CM4s over the course of a month or two. If you have the interest/budget and are just lacking the availability, I could keep lurking for others and acquire them for you?
Bug Report
Raspberry Pi CM4 doesn't reboot cleanly -- hangs after GRUB.
Description
I've installed 6x CM4 units in a https://github.com/DeskPi-Team/super6c mini-itx cluster board for the purpose of experimenting with Kubernetes in a home lab environment. Talos seems like a good fit. I managed to flail my way through the documentation, but after flashing and reflashing things many times, I have noticed that the individual modules won't reboot cleanly. Troubleshooting is made trickier by the fact that only the first of the 6 modules has a console output. While that was the one I made the control plane,
talosctl dmesg
and experimentation suggests that control plane vs worker is irrelevant here.I have tried both
talosctl reboot
andtalosctl reboot -m powercycle
, but it didn't make a difference. I found https://github.com/pftf/RPi4/issues/168 and attempted to add the kernel parameterefi=no_disable_early_pci_dma
(got hung up for a while figuring out how to make it add the parameter -- lurkers: you need to go through a talosctl upgrade), but that didn't affect anything. I also tried settingreboot=k
(which seems to be referenced in various examples , so I thought maybe it was common for some reason), but that didn't help either. I might have enough patience to try going through variousreboot=
parameters tomorrow, but I'm not holding out much hope.I did install regular old raspbian on one of the modules, and
sudo reboot
works fine; it comes back up and is sshable shortly. I previously had done one of the modules which now is running Talos, so it does not seem related to the specific compute module or the socket/location. I inspected the kernel parameters used by the default Raspbian install, but nothing stands out; these are the parameters it has set:console=serial0,115200 console=tty1 root=PARTUUID=8dae39b5-02 rootfstype=ext4 fsck.repair=yes rootwait
.I'm kind of out of ideas from here. Unfortunately this is a deal breaker for playing with Talos, since this board cannot power cycle individual modules: it's not possible for me to do a rolling upgrade (every node I reboot is dead until I turn everything off), which kind of defeats the whole point.
I do have some time on my hands and am willing to help run down what's happening if someone can guide me, though. I don't know much about the subtle mechanics of rebooting (I didn't know it was so complicated until today's googling!)
Logs
The only module I can directly observe is the first module, but the others appear to behave in the same fashion as far as I can observe. The module cold boots fine, but upon issuing
talosctl reboot
, the following happens:Nothing else happens from here; I've left it sitting for at least 30 minutes.
Here's what prints from
dmesg -f
up until the connection is closed:Environment
talosctl version --nodes <problematic nodes>
]kubectl version --short
]