Closed micahhausler closed 3 years ago
Smells like a kexec kernel/machine issue. Kexec is great when it works, and looks exactly like this when it doesn't ;). It might be related to #35, but would be good to hear from @thebsdbox to see if this workflow works on his machine (and then it'd be your machine at fault :D )
I opened #58 to try and debug, and here's what I got
KEXEC - Kernel Exec
------------------------
INFO[0000] Mounted [/dev/nvme0n1p1] -> [/mountAction]
INFO[0000] Loaded boot config: &grub.Config{Name:"'Debian GNU/Linux' ", Kernel:"/boot/vmlinuz-4.19.0-16-cloud-amd64", Initramfs:"/boot/initrd.img-4.19.0-16-cloud-amd64", KernelArgs:"root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true", Multiboot:"", MultibootArgs:"", Modules:[]string(nil)}
INFO[0000] Running Kexec: kernel: /mountAction/boot/vmlinuz-4.19.0-16-cloud-amd64, initrd: /mountAction/boot/initrd.img-4.19.0-16-cloud-amd64, cmdLine: root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true
INFO[0000] Rebooting system
So its this line where its hanging up:
// Call the unix reboot command with the kexec functionality
unix.Reboot(unix.LINUX_REBOOT_CMD_KEXEC)
I'm able to reproduce the behavior by running
kexec \
-l /mountAction/boot/vmlinuz-4.19.0-16-cloud-amd64 \
--initrd /mountAction/boot/initrd.img-4.19.0-16-cloud-amd64 \
--command-line="root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true"
# Hangs on this command
kexec -e
Running strace confirms that its the reboot(2)
call that is failing
linuxkit-54b203f08eda:/# strace kexec -e --debug --no-ifdown
execve("/usr/sbin/kexec", ["kexec", "-e", "--debug", "--no-ifdown"], 0x7ffeef863538 /* 14 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7f6186a88d48) = 0
set_tid_address(0x7f6186a8931c) = 1382
open("/etc/ld-musl-x86_64.path", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=136848, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3504\0\0\0\0\0\0"..., 960) = 960
mmap(NULL, 143360, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f61869d1000
mmap(0x7f61869d4000, 77824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x3000) = 0x7f61869d4000
mmap(0x7f61869e7000, 45056, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x16000) = 0x7f61869e7000
mmap(0x7f61869f2000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x20000) = 0x7f61869f2000
close(3) = 0
open("/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=100144, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0003\0\0\0\0\0\0"..., 960) = 960
mmap(NULL, 106496, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f61869b7000
mmap(0x7f61869ba000, 57344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x3000) = 0x7f61869ba000
mmap(0x7f61869c8000, 28672, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x11000) = 0x7f61869c8000
mmap(0x7f61869cf000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x17000) = 0x7f61869cf000
close(3) = 0
mprotect(0x7f61869f2000, 4096, PROT_READ) = 0
mprotect(0x7f61869cf000, 4096, PROT_READ) = 0
mprotect(0x7f6186a85000, 4096, PROT_READ) = 0
mprotect(0x556d6186c000, 16384, PROT_READ) = 0
access("/proc/xen", F_OK) = -1 ENOENT (No such file or directory)
open("/sys/kernel/kexec_loaded", O_RDONLY) = 3
read(3, "1\n", 1024) = 2
close(3) = 0
sync() = 0
reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC
Closing since this is not a hub action issue, but most likely a hardware problem
I've tried the basic deployment examples for Ubuntu (20.04), Debian, and cluster-api-provider-tink, and in every case kexec hangs indefinitely for me.
My device no longer respond to the power button, and no longer responds to keyboard input.
After manually unplugging the power on my device and then pressing the power button, the provisioned OS boots.
Expected Behaviour
I'd expect kexec to boot a kernel after a few seconds, and come up with the provisioned operating system
Current Behaviour
The machine hangs after the
kexec
action startsPossible Solution
Help! I'd love to be told I'm holding it wrong or doing something dumb.
Steps to Reproduce (for bugs)
tink hardware create < nuc03-hardware.json
/dev/nvme0n1
and partition/dev/nvme0n1p1
Prior to creating a workflow, I'm able to use the terminal and run the following commands:
I see the following kernel messages just prior to the
kexec
callI see the following log messages
Context
I can't get anything to provision properly.
Your Environment
Hardware:
Network setup:
I've built a custom hook image using this gist
I followed the above guides in the Tinkerbell docs and my worker consistently hangs on kexec.
Could be related to/duplicate of #35?