kairos-io / kairos

:penguin: The immutable Linux meta-distribution for edge Kubernetes.
https://kairos.io
Apache License 2.0
1.08k stars 94 forks source link

systemctl can't start k3s #21

Closed irishgordo closed 2 years ago

irishgordo commented 2 years ago

Hey there! I really have been enjoying getting started with c3os! I think it will serve as a great replacement for my 1U SuperMicro Server that was running K3OS.

I was leveraging the Manual Installation Portion - once I booted from the USB in GRUB2.

I had just shelled into the IP, SCP'd over a cloud_init file that looked like this:

c3os:
  network_token: "...."
  role: "master"
vpn:
  # EdgeVPN environment options
  DHCP: "false"
  ADDRESS: "10.1.0.2/24"

stages:
   initramfs:
     - name: "Set user and password"
       users:
        c3os:
          passwd: "c3os"
   network:
     - if: '[ ! -f "/run/cos/recovery_mode" ]'
       name: "Setup k3s"
       environment_file: "/etc/sysconfig/k3s"
       environment:
         K3S_TOKEN: "..."
       systemctl:
         start: 
         - k3s

Then just ran sudo elemental install /dev/sda --cloud-init ./cloud_init.yaml. The install went great! And then once it was done, I rebooted, popp'd the USB drive out, shell'd into the box and snagged the kubeconfig from: sudo cat /etc/rancher/k3s/k3s.yaml. I was able to then on my workstation, hop on the kubectl with that --kubeconfig file. I installed cert-manager and rancher. It all was working like a charm. Got into the dashboard and everything. Reset the self-generated password. Powered it down for the night.

When I booted it up this morning, I couldn't get kubectl to interact with the node. I hopp'd on and took a peek at the journalctl logs:

Apr 27 15:54:21 c3os kernel: Bridge firewalling registered
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.449018811Z" level=info msg="Starting k3s v1.21.10+k3s1 (471f5eb3)"
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.478135963Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.478294563Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.498595746Z" level=info msg="Database tables and indexes are up to date"
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.799893800Z" level=info msg="Kine listening on unix://kine.sock"
Apr 27 15:54:23 c3os k3s[1852]: time="2022-04-27T15:54:23.824414503Z" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"
Apr 27 15:54:23 c3os systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 15:54:23 c3os systemd[1]: k3s.service: Failed with result 'exit-code'.
Apr 27 15:54:23 c3os systemd[1]: Failed to start Lightweight Kubernetes.
Apr 27 15:54:23 c3os elemental[1731]: ERRO[2022-04-27T15:54:23Z] Job for k3s.service failed because the control process exited with error code.
Apr 27 15:54:23 c3os elemental[1731]: See "systemctl status k3s.service" and "journalctl -xe" for details.
Apr 27 15:54:23 c3os elemental[1731]: ERRO[2022-04-27T15:54:23Z] failed to run systemctl start k3s: exit status 1
Apr 27 15:54:23 c3os elemental[1731]: ERRO[2022-04-27T15:54:23Z] 1 error occurred:
Apr 27 15:54:23 c3os elemental[1731]:         * failed to run systemctl start k3s: exit status 1
Apr 27 15:54:23 c3os elemental[1731]:  
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Stage 'network'. Defined stages: 1. Errors: true
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Done executing stage 'network'
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Running stage: network.after
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/00_datasource.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/00_rootfs.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/02_agent.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/03_branding.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/04_installer.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/05_network.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/06_recovery.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/07_live.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/10_accounting.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/11_persistency.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/20_recovery_mode.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /system/oem/21_grub.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing /oem/99_custom.yaml
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Done executing stage 'network.after'
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Running stage: network.before
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_ACTIVE cos-img/filename=/cOS/active.img pani>
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Done executing stage 'network.before'
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Running stage: network
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_ACTIVE cos-img/filename=/cOS/active.img pani>
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Done executing stage 'network'
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Running stage: network.after
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Executing BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 console=ttyS0 root=LABEL=COS_ACTIVE cos-img/filename=/cOS/active.img pani>
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Done executing stage 'network.after'
Apr 27 15:54:23 c3os elemental[1731]: INFO[2022-04-27T15:54:23Z] Some errors found but were ignored. Enable --strict mode to fail on those or --debug to see them in the log
Apr 27 15:54:23 c3os elemental[1731]: WARN[2022-04-27T15:54:23Z] 2 errors occurred:
Apr 27 15:54:23 c3os elemental[1731]:         * No metadata/userdata found. Bye
Apr 27 15:54:23 c3os elemental[1731]:         * failed to run systemctl start k3s: exit status 1
Apr 27 15:54:23 c3os elemental[1731]:  
Apr 27 15:54:23 c3os systemd[1]: cos-setup-network.service: Succeeded.
Apr 27 15:54:23 c3os systemd[1]: Finished cOS setup after network.
Apr 27 15:54:23 c3os systemd[1]: Started c3os agent.
Apr 27 15:54:23 c3os elemental[1339]: INFO[2022-04-27T15:54:23Z] Command output:

###
### Further Down In the Journalctl Logs, the below repeats several times
###

Apr 27 16:44:54 c3os systemd[1]: k3s.service: Failed with result 'exit-code'.
Apr 27 16:44:54 c3os systemd[1]: Failed to start Lightweight Kubernetes.
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.704Z        INFO        c3os        service/node.go:300        Applying role 'auto'
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.704Z        INFO        c3os        service/role.go:115        Role loaded. Applying auto
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.705Z        INFO        c3os        role/auto.go:23        Active nodes:[]
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.706Z        INFO        c3os        role/auto.go:24        Advertizing nodes:[]
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.706Z        INFO        c3os        role/auto.go:27        Not enough nodes
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.706Z        INFO        c3os        service/node.go:300        Applying role 'master'
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.706Z        INFO        c3os        service/role.go:115        Role loaded. Applying master
Apr 27 16:44:59 c3os c3os[1866]: 2022-04-27T16:44:59.706Z        WARN        c3os        go-log@v1.0.5/log.go:175        Failed applying rolemasternode doesn't have an ip yet
Apr 27 16:44:59 c3os systemd[1]: k3s.service: Scheduled restart job, restart counter is at 449.
Apr 27 16:44:59 c3os systemd[1]: Stopped Lightweight Kubernetes.
Apr 27 16:44:59 c3os systemd[1]: Starting Lightweight Kubernetes...
Apr 27 16:44:59 c3os sh[13919]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Apr 27 16:44:59 c3os sh[13924]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.230256008Z" level=info msg="Starting k3s v1.21.10+k3s1 (471f5eb3)"
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.234393722Z" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.234616000Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.235844848Z" level=info msg="Database tables and indexes are up to date"
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.268516607Z" level=info msg="Kine listening on unix://kine.sock"
Apr 27 16:45:01 c3os k3s[13930]: time="2022-04-27T16:45:01.316561140Z" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"
Apr 27 16:45:01 c3os systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 16:45:01 c3os systemd[1]: k3s.service: Failed with result 'exit-code'.

I'm thinking I just may have configured the cloud-init incorrectly? Should I have given a token that was alpha-numeric aside from just like ellipses "..."/"...."? Is it failing to start because of the: No metadata/userdata found. Bye? Would this be something I could correct on the box somewhere in the /oem/ path - or would it be easiest to blow it away and re-install it with a better cloud-init yaml config?

Thanks again for all the hard work on this project! It's awesome! I definitely appreciate any info/feedback!

mudler commented 2 years ago

Hi! thanks a lot for the feedback, it's very appreciated!

I think the issue here is because in the cloud-init config there is both a vpn/c3os section and a manual start of k3s defined among the steps. If you don't need the vpn/c3os featureset, you can just drop the c3os/vpn block entirely and go manual.

viceversa, if you don't intend to manually configure k3s you can just use the c3os stanza that will configure k3s automatically, so no need to specify any k3s bit in the cloud-init.

TL;DR

or you go with:

c3os:
  network_token: "...."
  role: "master"
vpn:
  # EdgeVPN environment options
  DHCP: "false"
  ADDRESS: "10.1.0.2/24"

stages:
   initramfs:
     - name: "Set user and password"
       users:
        c3os:
          passwd: "c3os"

or (note the network stage defined)

stages:
   initramfs:
     - name: "Set user and password"
       users:
        c3os:
          passwd: "c3os"
   network:
     - if: '[ ! -f "/run/cos/recovery_mode" ]'
       name: "Setup k3s"
       environment_file: "/etc/sysconfig/k3s"
       environment:
         K3S_TOKEN: "..."
       systemctl:
         start: 
         - k3s
mudler commented 2 years ago

Things have been smoothed up in the latest release, you can now just set (documented here for reference):

k3s:
  enabled: true
  # Additional (optional) env/args for k3s:
  # env:
  #  K3S_RESOLV_CONF: ""
  #  K3S_DATASTORE_ENDPOINT: "..."
  # args:
  # - --cluster-init
  # - ...
stages:
   initramfs:
     - name: "Set user and password"
       users:
        c3os:
          passwd: "c3os"
irishgordo commented 2 years ago

OoOO! Awesome! Thanks for the info about the new release @mudler .

I had actually pivoted back to running 22.04 ubuntu-server on that bare-metal device. Though it consumes more resources than I'd like even with a minimal install and running K3s.

I will be trying C3OS again. I definitely would enjoy having anything run on their that can run a K3s cluster with minimal overhead like what I believe C3OS will accomplish.
I had tried both the opensuse and the alpine version. Is one release more robust for a lower-end system than the other?

Am I correct in still understanding that C3OS can run as a standalone K3s cluster?

The main goal would be just getting this tiny bare-metal server to run K3s, install cert-manager, and install Rancher on it, and have it be happy just running along not chomping through a ton of resources - single node, no 'highly available' kinda setup.

Also since I'm just more or less chatting about this, I'll go ahead and close this issue but I'd enjoy hearing your thoughts! :smile:

irishgordo commented 2 years ago

Also thanks for adding this example too! https://github.com/c3os-io/c3os/blob/master/examples/k3s-server.yaml It's great! :smile:

mudler commented 2 years ago

OoOO! Awesome! Thanks for the info about the new release @mudler .

I had actually pivoted back to running 22.04 ubuntu-server on that bare-metal device. Though it consumes more resources than I'd like even with a minimal install and running K3s.

I will be trying C3OS again. I definitely would enjoy having anything run on their that can run a K3s cluster with minimal overhead like what I believe C3OS will accomplish. I had tried both the opensuse and the alpine version. Is one release more robust for a lower-end system than the other?

The alpine release isn't officially supported yet as it doesn't go through the same openSUSE tests (at the moment). Although generally speaking it has less services running and a smaller base image size. As soon as it gets the same coverage it will be marked as ready to go

Am I correct in still understanding that C3OS can run as a standalone K3s cluster?

Yes it can, now the bootstrap process it does handle the k3s setup automatically also standalone

The main goal would be just getting this tiny bare-metal server to run K3s, install cert-manager, and install Rancher on it, and have it be happy just running along not chomping through a ton of resources - single node, no 'highly available' kinda setup.

Also since I'm just more or less chatting about this, I'll go ahead and close this issue but I'd enjoy hearing your thoughts! smile

That sounds a great fit :) actually feedback is really welcome, helps me understand where doc pitfalls are at the moment, thanks!

mudler commented 2 years ago

Ok, just as an update - both releases are functional now (openSUSE and alpine based) so they can be both used. There is also now a guided interactive-install mode, so things should be quite smooth now!

irishgordo commented 2 years ago

@mudler awesome!! :partying_face: - that interactive install looks great! I'll try rolling this out to bare metal today!