k0sproject / k0sctl

A bootstrapping and management tool for k0s clusters.
Other
387 stars 76 forks source link

tee: /usr/local/bin/k0s: Text file busy #357

Open twz123 opened 2 years ago

twz123 commented 2 years ago

Upgrading a cluster from one node to three nodes failed with the following log line:

level=fatal msg="upload failed: Process exited with status 1 (tee: /usr/local/bin/k0s: Text file busy\n)"

Target OS: Alpine 3.15 k0sctl version: 0.13.0-rc.1-1-gaf2f60b (af2f60b896c1b4ba4f1e6016fe445d9cfa7fe247) k0sctl.log

A second run of k0sctl also fails because it tries to join new controllers by requesting a token from the wrong node (the newly created one which hasn't been joined):

time="25 Mar 22 10:29 CET" level=debug msg="[ssh] 10.83.134.16:22: executing sudo -s /usr/local/bin/k0s token create --role controller --expiry 10m0s"

Logs from the second run: k0sctl_2.log

Config used:

"apiVersion": "k0sctl.k0sproject.io/v1beta1"
"kind": "Cluster"
"metadata":
  "name": "k0s-cluster"
"spec":
  "hosts":
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.28"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.16"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.66"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  "k0s":
    "config":
      "spec":
        "telemetry":
          "enabled": false
    "version": "v1.23.3+k0s.1"
twz123 commented 2 years ago

I was able to reproduce it (the "tee" error) without the upscale, i.e. running k0sctl again on a single node that has already been provisioned via a prior run of k0sctl.

I noticed that k0sctl wants to upgrade even if the target host is already running the correct version.

WARN [ssh] 10.83.134.135:22: k0s will be upgraded

kke commented 2 years ago

I noticed that k0sctl wants to upgrade even if the target host is already running the correct version.

Yes, this always happens when k0sBinaryPath or files: is used because k0sctl didn't know if the file was changed. Now that k0sctl can detect local vs remote file changes, it should probably take this into consideration when deciding if the upgrade workflow should be chosen or not.

kke commented 2 years ago

A second run of k0sctl also fails because it tries to join new controllers by requesting a token from the wrong node (the newly created one which hasn't been joined)

I wonder how this happens. The K0sLeader() should always pick a host that has k0s running.

kke commented 2 years ago

tee: /usr/local/bin/k0s: Text file busy

The only possible explanation for this is that k0s is still running when trying to replace the binary.

chattytak commented 2 years ago

I had the exact same problem. The first update was met with "tee: /usr/local/bin/k0s: Text file busy" and the k0s binary was removed from the node where the error occurred. I then tried to update again using k0sctl, but failed when trying to do a token generation and join. However, this relocated the k0s binary on the node, so after starting the service again with systemctl from the node, the update was performed again with k0sctl, and the process ended successfully.

twz123 commented 2 years ago

This is definitely some timing issue. There's the check if k0s is still running, but maybe this check just races when the actual process is about to terminate but not quite terminated. When rerunning k0sctl apply again (after some seconds), the binary can be uploaded again, but will fail later on when trying to invoke k0s install (#362).

I see multiple ways of fixing this:

kke commented 2 years ago

Hmmm, this is a forced upgrade because of the presence of files. The "upload binaries" phase should be skipped because k0s is going to be upgraded. There's some error in the host selection logic in that phase.

twz123 commented 2 years ago

Reopening as this is not yet resolved.