Open rocking-e opened 3 years ago
There is a workaround for the issue. It involves 2 changes. One is to add an SSM parameter resource to the ssm.tf
file in the terraform. Give it an initial value that can't be mistaken for a valid sha256 hash.
name = "/teleport/${var.cluster_name}/ca-pin-hash"
type = "String"
value = "0xdeadbeef"
overwrite = true
tags = local.required_tags
lifecycle {
ignore_changes = [value, type]
}
}
This will cause the SSM parameter to be deleted anytime the cluster is "destroyed" via terraform destroy
.
The other change needs to be made in the files proxy-user-data.tpl
and node-user-data.tpl
in the terraform directory. Add the following line to the end of the file:
echo -e '\n\ngrep sha256 /var/lib/teleport/ca-pin-hash\n' >>/usr/local/bin/teleport-ssm-get-token
That line will cause the teleport-ssm-get-token
script to fail if the value read from SSM doesn't begin with sha256
which it won't if it exists via the terraform plan
but before the auth node has updated the value. And the failure will then cause systemd
to try a restart.
We actually use a placeholder string for the CA pin hash in the config generation script (https://github.com/gravitational/teleport/blob/master/assets/aws/files/bin/teleport-generate-config#L269) - I can modify the Terraform to set this same value in SSM initially, so that it will always be managed/destroyed by Terraform.
I'll also look at a PR to modify the AMI and cause the teleport-ssm-get-token
script to exit with error if the value from SSM doesn't begin with sha256:
, which should solve the problem in the same way as you've described.
Description
When executing the sequence
terraform apply
followed byterraform destory
and then anotherterraform apply
the auth nodes will initialized properly but the proxy and nodes will not. The error on the proxy/nodes will show a TLS error with the log message saying this might happen if the certificates on auth server don't match the ca-pin-hash.This occurs because because the value in the SSM parameter for ca-pin-hash is not correct. The value in that parameter is set by one of the auth nodes when it starts up. On the first "apply" it is written properly and the proxy/node servers get the correct value. However, the "destroy" does not remove that SSM parameter. So it is there when the proxy/node processes start and they read it. But that value is from the previous run. The auth servers haven't initialized it yet.
A systemd restart is required on the proxy and node servers to get them initialized properly. This is because to
systemd
the teleport process was started successfully. I'm guessing that on the very first "apply" the proxy/node systemd start up services fail because the SSM parameter doesn't exist. Sosystemd
tries a restart. After one or two it will succeed.Reproduction Steps
After the second "apply" the auth nodes will show as "up" via the output of
tctl status
. If you look on the proxy/node service in the/var/lib/teleport/ca-pin-hash
file the value there will not match the output oftctl status
on the auth node.Server Details
Debug Logs
Some logs are available in the ticket https://gravitational.zendesk.com/hc/en-us/requests/3275