tinkerbell / actions

Suite of Tinkerbell Actions for use in Tinkerbell Workflows
Apache License 2.0
27 stars 42 forks source link

Kexec hangs after execution #57

Closed micahhausler closed 3 years ago

micahhausler commented 3 years ago

I've tried the basic deployment examples for Ubuntu (20.04), Debian, and cluster-api-provider-tink, and in every case kexec hangs indefinitely for me.

My device no longer respond to the power button, and no longer responds to keyboard input.

After manually unplugging the power on my device and then pressing the power button, the provisioned OS boots.

Expected Behaviour

I'd expect kexec to boot a kernel after a few seconds, and come up with the provisioned operating system

Current Behaviour

The machine hangs after the kexec action starts

Possible Solution

Help! I'd love to be told I'm holding it wrong or doing something dumb.

Steps to Reproduce (for bugs)

  1. Create hardware with mac address of worker tink hardware create < nuc03-hardware.json
  2. Create template for debian install using disk /dev/nvme0n1 and partition /dev/nvme0n1p1
    version: "0.1"
    name: debian
    global_timeout: 1800
    tasks:
    - name: "os-installation"
    worker: "{{.device_1}}"
    volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
    actions:
      - name: "stream-debian-image"
        image: quay.io/tinkerbell-actions/image2disk:v1.0.0
        timeout: 600
        environment:
          DEST_DISK: /dev/nvme0n1
          IMG_URL: "http://10.1.1.11:8080/debian-10-openstack-amd64.raw.gz"
          COMPRESSED: true
      - name: "kexec-debian"
        image: quay.io/tinkerbell-actions/kexec:v1.0.0
        timeout: 90
        pid: host
        environment:
          BLOCK_DEVICE: /dev/nvme0n1p1
          FS_TYPE: ext4
  3. Create workflow
    tink workflow create -t $TEMPLATE_ID -r '{"device_1": "$NUC03_MAC"}'

Prior to creating a workflow, I'm able to use the terminal and run the following commands:

nsenter -a -t `pidof dockerd`
docker ps -a 
docker logs -f $TINK_WORKER_CONTAINER

I see the following kernel messages just prior to the kexec call

EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
kexec_core: Starting new kernel

I see the following log messages

IMAGE2DISK - Cloud image streamer
------------------------
INFO[0000] Beginning write of image [debian-10-openstack-amd64.raw.gz] to disk [/dev/nvme0n1] 
Downloading... 1.7 GB complete     
INFO[0010] Successfully written [http://10.1.1.11:8080/debian-10-openstack-amd64.raw.gz] to [/dev/nvme0n1] 
KEXEC - Kernel Exec
------------------------
INFO[0000] Mounted [/dev/nvme0n1p1] -> [/mountAction]   
$ tink workflow events 3efc93ac-d07b-11eb-ba8d-0242ac120004
+--------------------------------------+-----------------+-------------------------------+----------------+---------------------------------+---------------+
| WORKER ID                            | TASK NAME       | ACTION NAME                   | EXECUTION TIME | MESSAGE                         | ACTION STATUS |
+--------------------------------------+-----------------+-------------------------------+----------------+---------------------------------+---------------+
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | stream-debian-image           |              0 | Started execution               | STATE_RUNNING |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | stream-debian-image           |             10 | finished execution successfully | STATE_SUCCESS |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | add-tink-cloud-init-config    |              0 | Started execution               | STATE_RUNNING |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | add-tink-cloud-init-config    |              0 | finished execution successfully | STATE_SUCCESS |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | add-tink-cloud-init-ds-config |              0 | Started execution               | STATE_RUNNING |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | add-tink-cloud-init-ds-config |              0 | finished execution successfully | STATE_SUCCESS |
| ce2e62ed-826f-4485-a39f-a82bb74338e2 | os-installation | kexec-debian                  |              0 | Started execution               | STATE_RUNNING |
+--------------------------------------+-----------------+-------------------------------+----------------+---------------------------------+---------------+
$ tink workflow state 3efc93ac-d07b-11eb-ba8d-0242ac120004
+----------------------+--------------------------------------+
| FIELD NAME           | VALUES                               |
+----------------------+--------------------------------------+
| Workflow ID          | 3efc93ac-d07b-11eb-ba8d-0242ac120004 |
| Workflow Progress    | 75%                                  |
| Current Task         | os-installation                      |
| Current Action       | kexec-debian                         |
| Current Worker       | ce2e62ed-826f-4485-a39f-a82bb74338e2 |
| Current Action State | STATE_RUNNING                        |
+----------------------+--------------------------------------+

Context

I can't get anything to provision properly.

Your Environment

Hardware:

Network setup:

I've built a custom hook image using this gist

I followed the above guides in the Tinkerbell docs and my worker consistently hangs on kexec.

Could be related to/duplicate of #35?

mmlb commented 3 years ago

Smells like a kexec kernel/machine issue. Kexec is great when it works, and looks exactly like this when it doesn't ;). It might be related to #35, but would be good to hear from @thebsdbox to see if this workflow works on his machine (and then it'd be your machine at fault :D )

micahhausler commented 3 years ago

I opened #58 to try and debug, and here's what I got

KEXEC - Kernel Exec
------------------------
INFO[0000] Mounted [/dev/nvme0n1p1] -> [/mountAction]  
INFO[0000] Loaded boot config: &grub.Config{Name:"'Debian GNU/Linux' ", Kernel:"/boot/vmlinuz-4.19.0-16-cloud-amd64", Initramfs:"/boot/initrd.img-4.19.0-16-cloud-amd64", KernelArgs:"root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true", Multiboot:"", MultibootArgs:"", Modules:[]string(nil)} 
INFO[0000] Running Kexec: kernel: /mountAction/boot/vmlinuz-4.19.0-16-cloud-amd64, initrd: /mountAction/boot/initrd.img-4.19.0-16-cloud-amd64, cmdLine: root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true 
INFO[0000] Rebooting system     

So its this line where its hanging up:

// Call the unix reboot command with the kexec functionality
unix.Reboot(unix.LINUX_REBOOT_CMD_KEXEC)

I'm able to reproduce the behavior by running

kexec \
  -l /mountAction/boot/vmlinuz-4.19.0-16-cloud-amd64 \
  --initrd /mountAction/boot/initrd.img-4.19.0-16-cloud-amd64 \
  --command-line="root=UUID=af5f8c82-a495-40e0-b3e6-f361abeec5a4 ro nosplash text biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 systemd.show_status=true"
# Hangs on this command
kexec -e
micahhausler commented 3 years ago

Running strace confirms that its the reboot(2) call that is failing

linuxkit-54b203f08eda:/# strace kexec -e --debug --no-ifdown
execve("/usr/sbin/kexec", ["kexec", "-e", "--debug", "--no-ifdown"], 0x7ffeef863538 /* 14 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7f6186a88d48) = 0
set_tid_address(0x7f6186a8931c)         = 1382
open("/etc/ld-musl-x86_64.path", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/liblzma.so.5", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=136848, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3504\0\0\0\0\0\0"..., 960) = 960
mmap(NULL, 143360, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f61869d1000
mmap(0x7f61869d4000, 77824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x3000) = 0x7f61869d4000
mmap(0x7f61869e7000, 45056, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x16000) = 0x7f61869e7000
mmap(0x7f61869f2000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x20000) = 0x7f61869f2000
close(3)                                = 0
open("/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=100144, ...}) = 0
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0003\0\0\0\0\0\0"..., 960) = 960
mmap(NULL, 106496, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f61869b7000
mmap(0x7f61869ba000, 57344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x3000) = 0x7f61869ba000
mmap(0x7f61869c8000, 28672, PROT_READ, MAP_PRIVATE|MAP_FIXED, 3, 0x11000) = 0x7f61869c8000
mmap(0x7f61869cf000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x17000) = 0x7f61869cf000
close(3)                                = 0
mprotect(0x7f61869f2000, 4096, PROT_READ) = 0
mprotect(0x7f61869cf000, 4096, PROT_READ) = 0
mprotect(0x7f6186a85000, 4096, PROT_READ) = 0
mprotect(0x556d6186c000, 16384, PROT_READ) = 0
access("/proc/xen", F_OK)               = -1 ENOENT (No such file or directory)
open("/sys/kernel/kexec_loaded", O_RDONLY) = 3
read(3, "1\n", 1024)                    = 2
close(3)                                = 0
sync()                                  = 0
reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC
micahhausler commented 3 years ago

Closing since this is not a hub action issue, but most likely a hardware problem