poseidon / typhoon

Minimal and free Kubernetes distribution with Terraform
https://typhoon.psdn.io/
MIT License
1.93k stars 320 forks source link

Bare-metal fails to bootstrap when `terraform apply` gets cancelled #637

Closed waddles closed 4 years ago

waddles commented 4 years ago

Bug

Environment

Problem

Typhoon fails to bootstrap the kubernetes cluster

module.cluster.null_resource.bootstrap: Provisioning with 'remote-exec'...
module.cluster.null_resource.bootstrap (remote-exec): Connecting to remote host via SSH...
module.cluster.null_resource.bootstrap (remote-exec):   Host: node0.cluster.example.com
module.cluster.null_resource.bootstrap (remote-exec):   User: core
module.cluster.null_resource.bootstrap (remote-exec):   Password: false
module.cluster.null_resource.bootstrap (remote-exec):   Private key: false
module.cluster.null_resource.bootstrap (remote-exec):   Certificate: false
module.cluster.null_resource.bootstrap (remote-exec):   SSH Agent: true
module.cluster.null_resource.bootstrap (remote-exec):   Checking Host Key: false
module.cluster.null_resource.bootstrap (remote-exec): Connected!
module.cluster.null_resource.bootstrap (remote-exec): Job for bootstrap.service failed because the control process exited with error code.
module.cluster.null_resource.bootstrap (remote-exec): See "systemctl status bootstrap.service" and "journalctl -xe" for details.
core@node0 ~ $ cat /tmp/terraform_1985083120.sh
#!/bin/sh
sudo systemctl start bootstrap
core@node0 ~ $ journalctl -u bootstrap
-- Logs begin at Mon 2020-02-10 02:19:46 UTC. --
Feb 10 02:21:00 node0.cluster.example.com rkt[1442]: Downloading sha256:4062d80041b 107 MB / 107 MB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:6b69eb11d04 408 B / 408 B
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:9c47fde751a 375 B / 375 B
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:8d523ca27b7 917 B / 917 B
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:be2693a52da 652 KB / 652 KB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:0abeb150076 4.35 KB / 4.35 KB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:23b6daf06fc 15.7 MB / 15.7 MB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:7d7512f8b20 123 MB / 123 MB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:346aee5ea5b 17.7 MB / 17.7 MB
Feb 10 02:21:01 node0.cluster.example.com rkt[1442]: Downloading sha256:4062d80041b 107 MB / 107 MB
Feb 10 02:21:34 node0.cluster.example.com rkt[1442]: Failed to stat /opt/bootstrap/assets: No such file or directory
Feb 10 02:21:34 node0.cluster.example.com systemd[1]: bootstrap.service: Main process exited, code=exited, status=1/FAILURE
Feb 10 02:21:34 node0.cluster.example.com systemd[1]: bootstrap.service: Failed with result 'exit-code'.
Feb 10 02:21:34 node0.cluster.example.com systemd[1]: Failed to start Kubernetes control plane.
core@node0 ~ $ systemctl cat bootstrap
# /etc/systemd/system/bootstrap.service
[Unit]
Description=Kubernetes control plane
ConditionPathExists=!/opt/bootstrap/bootstrap.done
[Service]
Type=oneshot
RemainAfterExit=true
WorkingDirectory=/opt/bootstrap
ExecStartPre=-/usr/bin/bash -c 'set -x && [ -n "$(ls /opt/bootstrap/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootstrap/assets/manifests-*/* /opt/bootstrap/assets/manifests && rm -rf /opt/bootstrap/assets/manifests-*'
ExecStart=/usr/bin/rkt run \
    --trust-keys-from-https \
    --volume config,kind=host,source=/etc/kubernetes/bootstrap-secrets \
    --mount volume=config,target=/etc/kubernetes/secrets \
    --volume assets,kind=host,source=/opt/bootstrap/assets \
    --mount volume=assets,target=/assets \
    --volume script,kind=host,source=/opt/bootstrap/apply \
    --mount volume=script,target=/apply \
    --insecure-options=image \
    docker://k8s.gcr.io/hyperkube:v1.17.2 \
    --net=host \
    --dns=host \
    --exec=/apply
ExecStartPost=/bin/touch /opt/bootstrap/bootstrap.done
[Install]
WantedBy=multi-user.target
core@node0 ~ $ ls -l /opt/bootstrap/
total 8
-r-xr--r--. 1 root root 254 Feb 10 02:19 apply
-r-xr--r--. 1 root root 798 Feb 10 02:19 layout

Nothing appears to call the /opt/bootstrap/layout script and when run manually, I get this error:

core@node0 ~ $ cat /opt/bootstrap/layout
#!/bin/bash -e
mkdir -p -- auth tls/etcd tls/k8s static-manifests manifests/coredns manifests-networking
awk '/#####/ {filename=$2; next} {print > filename}' assets
mkdir -p /etc/ssl/etcd/etcd
mkdir -p /etc/kubernetes/bootstrap-secrets
mv tls/etcd/{peer*,server*} /etc/ssl/etcd/etcd/
mv tls/etcd/etcd-client* /etc/kubernetes/bootstrap-secrets/
chown -R etcd:etcd /etc/ssl/etcd
chmod -R 500 /etc/ssl/etcd
mv auth/kubeconfig /etc/kubernetes/bootstrap-secrets/
mv tls/k8s/* /etc/kubernetes/bootstrap-secrets/
sudo mkdir -p /etc/kubernetes/manifests
sudo mv static-manifests/* /etc/kubernetes/manifests/
sudo mkdir -p /opt/bootstrap/assets
sudo mv manifests /opt/bootstrap/assets/manifests
sudo mv manifests-networking /opt/bootstrap/assets/manifests-networking
rm -rf assets auth static-manifests tls
core@node0 ~ $ sudo /opt/bootstrap/layout
awk: fatal: cannot open file `assets' for reading (No such file or directory)

Desired Behavior

Assets should be written to controller nodes before attempting bootstrap

Steps to Reproduce

# Provision a kubernetes cluster on the bare-metal servers provided

locals {
  cluster_controller_count = 3
  cluster_macs = [
    // list of mac addresses for eth0 of nodes
  ]
  cluster_domain = "example.com"
}

module "cluster" {
  source = "git::https://github.com/poseidon/typhoon//bare-metal/container-linux/kubernetes?ref=v1.17.2"

  # bare-metal
  cluster_name            = local.cluster_name
  matchbox_http_endpoint  = var.matchbox_http_endpoint
  os_channel              = var.os_channel
  os_version              = var.os_version
  cached_install          = true

  # configuration
  k8s_domain_name    = "k8s.${local.cluster_name}.${local.cluster_domain}"
  ssh_authorized_key = var.ssh_authorized_key

  # machines
  controllers = [
    for index, x in slice(local.cluster_macs,0,local.cluster_controller_count) : {
      name      = "node${index}"
      mac       = x
      domain    = "node${index}.${local.cluster_domain}"
    }
  ]
  workers = [
    for index, x in slice(local.cluster_macs,local.cluster_controller_count,length(local.cluster_macs)) : {
      name      = "node${index + local.cluster_controller_count}"
      mac       = x
      domain    = "node${index + local.cluster_controller_count}.${local.cluster_domain}"
    }
  ]
}
waddles commented 4 years ago

The cause is due to using $HOME in the remote file provisioner of https://github.com/poseidon/typhoon/blob/v1.17.2/bare-metal/container-linux/kubernetes/ssh.tf#L37

A similar problem exists on the workers due to https://github.com/poseidon/typhoon/blob/v1.17.2/bare-metal/container-linux/kubernetes/ssh.tf#L69

I removed the $HOME/ from all 3 provisioners (kubeconfig and assets of copy-controller-secrets as well as the kubeconfig of copy-worker-secrets). It now correctly copies the files over and executes the scripts.

A further problem I found is that the /opt/bootstrap/layout script is not idempotent and fails on second execution. I recommend adding a -x flag to the shebang, which has the added benefit of tracing where it fails in terraform debug logs.

dghubble commented 4 years ago

Modules (including bare-metal) are used in production without issue and I've not reproduced this. So you'll want to focus on how your environment may differ.

Likely, something went wrong in your initial copy-controller-secrets (terraform logs) for Terraform to believe the operation was completed successfully and continue. I'd recommend using Terraform debug options to inspect what occurs during that step and why $HOME expansion isn't working for you.

Some ideas to consider:

waddles commented 4 years ago

$HOME in the destination path is superfluous anyway since that's where ssh writes files with relative paths. It is correctly expanded in the remote-exec section since the remote bash shell is doing the expansion, not ssh.

I ran all manner of Terraform debugging with TF_LOG=trace and I would see it connect and run the systemctl start bootstrap snippet but fail and leave behind the Terraform temporary script file, so I knew remote execution over ssh was working. The difference between null_resource.bootstrap and null_resource.copy-controller-secrets is that the latter attempts to copy files over first. Terraform never gave any indication of it failing to "create" those resources, even under trace level debugging. It was only after adding the -x flag to /opt/bootstrap/layout that I saw anything sensible from Terraform.

Once I narrowed it down to the file provisioner of copy-controller-secrets, I tried using full path, no path and tilde expansion, but the latter also failed.

I am running Terraform on macOS 10.15.2

Maybe this is peculiar to Mac, but removing $HOME and adding -x shouldn't hurt anything but will make debugging a lot easier. I would create a PR but I don't have a way of testing the other platforms that probably also suffer from this issue.

dghubble commented 4 years ago

I cannot reproduce this on a fresh macOS Catalina 10.15.3 setup with Terraform v0.12.20 (same plugins as report), provisioning a bare-metal v1.17.2 cluster with Flatcar Linux 2303.4.0 (stable). That does align with the existing user base (incl. macOS folks) not having trouble.

Since your issue appears in the copy-controller-secrets step, to investigate you can have Terraform re-run that step (so you don't need to keep making clusters).

export TF_LOG=debug
terraform taint "module.mycluster.null_resource.copy-controller-secrets[0]"
terraform apply

Alternately, remove Typhoon from the equation entirely. If you can create a more minimal example, you may be able to determine what going on or report it upstream.

resource "null_resource" "minimal" {
  connection {
    type    = "ssh"
    host    = "any-random-linux-box"
    user    = "core"
    timeout = "15m"
  }

  provisioner "file" {
    content     = "bar"
    destination = "$HOME/foo"
  }
}

You might also try applying from a different environment/setup. Also, in case its helpful, here are logs from a successful file provision step (ran in the initial apply, placed kubeconfig content in $HOME/kubeconfig).

module.mercury.null_resource.copy-worker-secrets[1]: Provisioning with 'file'...
2020-02-11T21:44:13.681-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Connecting to REDACTED:22 for SSH
2020-02-11T21:44:13.691-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Connection established. Handshaking for user core
2020-02-11T21:44:13.779-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Telling SSH config to forward to agent
2020-02-11T21:44:13.779-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Setting up a session to request agent forwarding
2020-02-11T21:44:13.927-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [INFO] agent forwarding enabled
2020-02-11T21:44:13.927-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] starting ssh KeepAlives
2020-02-11T21:44:13.928-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] opening new ssh session
2020-02-11T21:44:13.933-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Starting remote scp process:  scp -vt $HOME
2020-02-11T21:44:13.937-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Started SCP session, beginning transfers...
2020-02-11T21:44:13.938-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Copying input data into temporary file so we can read the length
2020-02-11T21:44:13.949-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Beginning file upload...
2020-02-11T21:44:13.963-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] SCP session complete, closing stdin pipe.
2020-02-11T21:44:13.963-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Waiting for SSH session to complete.
2020-02-11T21:44:13.972-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [ERROR] scp stderr: "Sink: C0644 5643 kubeconfig\n"
module.mercury.copy-worker-secrets[1]: Provisioning with 'remote-exec'...
...

Please hold onto any change proposals for a bit.

waddles commented 4 years ago

Well I am also unable to reproduce it with a clean state and freshly installed nodes.

The first few times I tried to Terraform the cluster, I always had a few nodes that didn't PXE boot cleanly and I was using Active Directory DHCP on 2nd boot which would mess up the DNS records I created manually. I've since modified my project to use Cloudflare DNS and static IPs by supplying an ignition yaml snippet for each node so AD is completely removed from the equation now. When the nodes didn't come up properly, I would have to cancel the Terraform operation to leave work for the day and that possibly confused the state.

So my method is now to target the resources needed to successfully boot and contact all nodes first, then come back and finish terraforming the cluster:

  1. Terraform DNS records
    terraform apply \
    -target module.$CLUSTER_NAME.cloudflare_record.api_a_records \
    -target module.$CLUSTER_NAME.cloudflare_record.node_a_records
  2. Terraform Matchbox resources
    terraform apply \
    -target module.$CLUSTER_NAME.module.cluster.matchbox_profile.cached-flatcar-linux-install \
    -target module.$CLUSTER_NAME.module.cluster.matchbox_group.install \
    -target module.$CLUSTER_NAME.module.cluster.matchbox_group.controller \
    -target module.$CLUSTER_NAME.module.cluster.matchbox_group.worker
  3. PXE boot all nodes and verify that they came up properly
    for i in {0..$NUM_NODES}; do
     ssh core@node${i}.$CLUSTER_NAME.example.com uptime
    done
  4. Terraform everything else
    terraform apply
dghubble commented 4 years ago

Glad to hear you've got the situation sorted out. Nodes do require stable names and you'd typically have a network router statically assign IPs (still via DHCP) and have DNS records correspond. Although I assume you've got reasons for preferring an Ignition snippet to set the static IP too.

Terraform knows the dependency graph among resources, so it shouldn't be required to apply individual targets. For example, null_resource steps will wait up to 60 minutes, since I know it can take a while to get all the machines to behave sometimes. Or if they timeout, the next terraform apply will happily retry later, after machines are fixed (say one is faulty and can't boot). But I think you're saying your goal is to do it as two phases of terraform apply (not counting DNS records since they're a prereq), in which case this seems like a decent approach.