Possible race condition in cluster initialization when using terraform in examples/aws/terraform/ha-autoscale-cluster/

Description

When executing the sequence terraform apply followed by terraform destory and then another terraform apply the auth nodes will initialized properly but the proxy and nodes will not. The error on the proxy/nodes will show a TLS error with the log message saying this might happen if the certificates on auth server don't match the ca-pin-hash.

This occurs because because the value in the SSM parameter for ca-pin-hash is not correct. The value in that parameter is set by one of the auth nodes when it starts up. On the first "apply" it is written properly and the proxy/node servers get the correct value. However, the "destroy" does not remove that SSM parameter. So it is there when the proxy/node processes start and they read it. But that value is from the previous run. The auth servers haven't initialized it yet.

A systemd restart is required on the proxy and node servers to get them initialized properly. This is because to systemd the teleport process was started successfully. I'm guessing that on the very first "apply" the proxy/node systemd start up services fail because the SSM parameter doesn't exist. So systemd tries a restart. After one or two it will succeed.

Reproduction Steps

terraform apply
terraform destroy
terraform apply

After the second "apply" the auth nodes will show as "up" via the output of tctl status. If you look on the proxy/node service in the /var/lib/teleport/ca-pin-hash file the value there will not match the output of tctl status on the auth node.

Server Details

Teleport version: Teleport Enterprise v7.1.3 git:v7.1.3-0-g79bdd9c19 go1.16.2
Server OS: Amazon Linux 2
Where are you running Teleport?: AWS, using terraform provided in the examples/aws/terraform/ha-authscale-cluster
Additional details:

Debug Logs

Some logs are available in the ticket https://gravitational.zendesk.com/hc/en-us/requests/3275

There is a workaround for the issue. It involves 2 changes. One is to add an SSM parameter resource to the ssm.tf file in the terraform. Give it an initial value that can't be mistaken for a valid sha256 hash.

  name      = "/teleport/${var.cluster_name}/ca-pin-hash"
  type      = "String"
  value     = "0xdeadbeef"
  overwrite = true
  tags = local.required_tags
  lifecycle {
    ignore_changes = [value, type]
  }
}

This will cause the SSM parameter to be deleted anytime the cluster is "destroyed" via terraform destroy.

The other change needs to be made in the files proxy-user-data.tpl and node-user-data.tpl in the terraform directory. Add the following line to the end of the file:

echo -e '\n\ngrep sha256 /var/lib/teleport/ca-pin-hash\n' >>/usr/local/bin/teleport-ssm-get-token

That line will cause the teleport-ssm-get-token script to fail if the value read from SSM doesn't begin with sha256 which it won't if it exists via the terraform plan but before the auth node has updated the value. And the failure will then cause systemd to try a restart.

We actually use a placeholder string for the CA pin hash in the config generation script (https://github.com/gravitational/teleport/blob/master/assets/aws/files/bin/teleport-generate-config#L269) - I can modify the Terraform to set this same value in SSM initially, so that it will always be managed/destroyed by Terraform.

I'll also look at a PR to modify the AMI and cause the teleport-ssm-get-token script to exit with error if the value from SSM doesn't begin with sha256:, which should solve the problem in the same way as you've described.

gravitational / teleport