Closed waddles closed 4 years ago
The cause is due to using $HOME
in the remote file provisioner of https://github.com/poseidon/typhoon/blob/v1.17.2/bare-metal/container-linux/kubernetes/ssh.tf#L37
A similar problem exists on the workers due to https://github.com/poseidon/typhoon/blob/v1.17.2/bare-metal/container-linux/kubernetes/ssh.tf#L69
I removed the $HOME/
from all 3 provisioners (kubeconfig and assets of copy-controller-secrets
as well as the kubeconfig of copy-worker-secrets
). It now correctly copies the files over and executes the scripts.
A further problem I found is that the /opt/bootstrap/layout
script is not idempotent and fails on second execution. I recommend adding a -x
flag to the shebang, which has the added benefit of tracing where it fails in terraform debug logs.
Modules (including bare-metal) are used in production without issue and I've not reproduced this. So you'll want to focus on how your environment may differ.
asset_dir
defaults to "" (recommended) and (as you've noticed) is unrelatedcopy-controller-secrets
step(s) distribute a minimal assets bundle to controllers, expanding $HOME
on the remote, and unpack/layout controller assets/credentials. The layout operation is not idempotent - its goal is to place credentials/permissions correctly and delete the bundle (rather than leaving credentials lying around).Likely, something went wrong in your initial copy-controller-secrets
(terraform logs) for Terraform to believe the operation was completed successfully and continue. I'd recommend using Terraform debug options to inspect what occurs during that step and why $HOME expansion isn't working for you.
Some ideas to consider:
$HOME
in the destination path is superfluous anyway since that's where ssh writes files with relative paths. It is correctly expanded in the remote-exec
section since the remote bash shell is doing the expansion, not ssh.
I ran all manner of Terraform debugging with TF_LOG=trace
and I would see it connect and run the systemctl start bootstrap
snippet but fail and leave behind the Terraform temporary script file, so I knew remote execution over ssh was working. The difference between null_resource.bootstrap
and null_resource.copy-controller-secrets
is that the latter attempts to copy files over first. Terraform never gave any indication of it failing to "create" those resources, even under trace level debugging. It was only after adding the -x
flag to /opt/bootstrap/layout
that I saw anything sensible from Terraform.
Once I narrowed it down to the file
provisioner of copy-controller-secrets
, I tried using full path, no path and tilde expansion, but the latter also failed.
I am running Terraform on macOS 10.15.2
Maybe this is peculiar to Mac, but removing $HOME
and adding -x
shouldn't hurt anything but will make debugging a lot easier. I would create a PR but I don't have a way of testing the other platforms that probably also suffer from this issue.
I cannot reproduce this on a fresh macOS Catalina 10.15.3 setup with Terraform v0.12.20 (same plugins as report), provisioning a bare-metal v1.17.2 cluster with Flatcar Linux 2303.4.0 (stable). That does align with the existing user base (incl. macOS folks) not having trouble.
Since your issue appears in the copy-controller-secrets
step, to investigate you can have Terraform re-run that step (so you don't need to keep making clusters).
export TF_LOG=debug
terraform taint "module.mycluster.null_resource.copy-controller-secrets[0]"
terraform apply
Alternately, remove Typhoon from the equation entirely. If you can create a more minimal example, you may be able to determine what going on or report it upstream.
resource "null_resource" "minimal" {
connection {
type = "ssh"
host = "any-random-linux-box"
user = "core"
timeout = "15m"
}
provisioner "file" {
content = "bar"
destination = "$HOME/foo"
}
}
You might also try applying from a different environment/setup. Also, in case its helpful, here are logs from a successful file provision step (ran in the initial apply, placed kubeconfig content in $HOME/kubeconfig).
module.mercury.null_resource.copy-worker-secrets[1]: Provisioning with 'file'...
2020-02-11T21:44:13.681-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Connecting to REDACTED:22 for SSH
2020-02-11T21:44:13.691-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Connection established. Handshaking for user core
2020-02-11T21:44:13.779-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Telling SSH config to forward to agent
2020-02-11T21:44:13.779-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Setting up a session to request agent forwarding
2020-02-11T21:44:13.927-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [INFO] agent forwarding enabled
2020-02-11T21:44:13.927-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] starting ssh KeepAlives
2020-02-11T21:44:13.928-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] opening new ssh session
2020-02-11T21:44:13.933-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Starting remote scp process: scp -vt $HOME
2020-02-11T21:44:13.937-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Started SCP session, beginning transfers...
2020-02-11T21:44:13.938-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Copying input data into temporary file so we can read the length
2020-02-11T21:44:13.949-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Beginning file upload...
2020-02-11T21:44:13.963-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] SCP session complete, closing stdin pipe.
2020-02-11T21:44:13.963-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [DEBUG] Waiting for SSH session to complete.
2020-02-11T21:44:13.972-0800 [DEBUG] plugin.terraform: file-provisioner (internal) 2020/02/11 21:44:13 [ERROR] scp stderr: "Sink: C0644 5643 kubeconfig\n"
module.mercury.copy-worker-secrets[1]: Provisioning with 'remote-exec'...
...
Please hold onto any change proposals for a bit.
Well I am also unable to reproduce it with a clean state and freshly installed nodes.
The first few times I tried to Terraform the cluster, I always had a few nodes that didn't PXE boot cleanly and I was using Active Directory DHCP on 2nd boot which would mess up the DNS records I created manually. I've since modified my project to use Cloudflare DNS and static IPs by supplying an ignition yaml snippet for each node so AD is completely removed from the equation now. When the nodes didn't come up properly, I would have to cancel the Terraform operation to leave work for the day and that possibly confused the state.
So my method is now to target the resources needed to successfully boot and contact all nodes first, then come back and finish terraforming the cluster:
terraform apply \
-target module.$CLUSTER_NAME.cloudflare_record.api_a_records \
-target module.$CLUSTER_NAME.cloudflare_record.node_a_records
terraform apply \
-target module.$CLUSTER_NAME.module.cluster.matchbox_profile.cached-flatcar-linux-install \
-target module.$CLUSTER_NAME.module.cluster.matchbox_group.install \
-target module.$CLUSTER_NAME.module.cluster.matchbox_group.controller \
-target module.$CLUSTER_NAME.module.cluster.matchbox_group.worker
for i in {0..$NUM_NODES}; do
ssh core@node${i}.$CLUSTER_NAME.example.com uptime
done
terraform apply
Glad to hear you've got the situation sorted out. Nodes do require stable names and you'd typically have a network router statically assign IPs (still via DHCP) and have DNS records correspond. Although I assume you've got reasons for preferring an Ignition snippet to set the static IP too.
Terraform knows the dependency graph among resources, so it shouldn't be required to apply individual targets. For example, null_resource
steps will wait up to 60 minutes, since I know it can take a while to get all the machines to behave sometimes. Or if they timeout, the next terraform apply
will happily retry later, after machines are fixed (say one is faulty and can't boot). But I think you're saying your goal is to do it as two phases of terraform apply (not counting DNS records since they're a prereq), in which case this seems like a decent approach.
Bug
Environment
Terraform v0.12.20
Problem
Typhoon fails to bootstrap the kubernetes cluster
Nothing appears to call the
/opt/bootstrap/layout
script and when run manually, I get this error:Desired Behavior
Assets should be written to controller nodes before attempting bootstrap
Steps to Reproduce