gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.41k stars 1.74k forks source link

Using terraform code for aws ha-cluster with graviton ami does not start cluster #23554

Closed filipvh-sentia closed 3 months ago

filipvh-sentia commented 1 year ago

Expected behavior:

Running the terraform code with sensible values using letsencrypt should start the cluster and make the webUI available.

Current behavior:

Bug details:

Teleport version

I used the gravitational-teleport-ami-oss-12.1.0

Recreation steps

I used the following configuration terraform configuration:

  ami_name                    = "gravitational-teleport-ami-oss-12.1.0"
  cluster_name                = "teleport"
  email                       = "my-email"
  grafana_pass                = "some-password"
  key_name                    = "my-key"
  region                      = var.region
  route53_domain              = "teleport.example.com"
  route53_zone                = "example.com"
  s3_bucket_name              = "my-bucket"
  use_acm                     = false
  enable_mongodb_listener     = false
  enable_mysql_listener       = false
  enable_postgres_listener    = false
  vpc_cidr                    = "172.0.0.0/16"

Debug logs

I asked on the Slack channel here: https://goteleport.slack.com/archives/CEZH6UL64/p1679649769238839 After creating the cluster I noticed the following issues:

When I logged into the proxy node I found the proxy service was returning the following error:

An error occurred (ParameterNotFound) when calling the GetParameter operation:

This should be set by the auth server ( I found going through the bin files ). On the auth server I found that the services that were supposed to run all had not run

It turns out that on my auth servers the teleport-lock script returned an error. Running the publish-tokens service returned a 404 on the IMDS service on the teleport-lock script.

# teleport-ssm-publish-tokens.service

root@ip-172-0-1-202 bin]# systemctl status teleport-ssm-publish-tokens.service
● teleport-ssm-publish-tokens.service - Service rotating teleport tokens
   Loaded: loaded (/etc/systemd/system/teleport-ssm-publish-tokens.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2023-03-24 09:06:37 UTC; 4min 37s ago
  Process: 1125 ExecStartPre=/usr/local/bin/teleport-lock (code=exited, status=255)

Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: <title>401 - Unauthorized</title>
Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: </head>
Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: <body>
Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: <h1>401 - Unauthorized</h1>
Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: </body>
Mar 24 09:06:36 ip-172-0-1-202.eu-west-1.compute.internal teleport-lock[1125]: </html>"}}
Mar 24 09:06:37 ip-172-0-1-202.eu-west-1.compute.internal systemd[1]: teleport-ssm-publish-tokens.service: control process exited, code=exited status=255
Mar 24 09:06:37 ip-172-0-1-202.eu-west-1.compute.internal systemd[1]: Failed to start Service rotating teleport tokens.
Mar 24 09:06:37 ip-172-0-1-202.eu-west-1.compute.internal systemd[1]: Unit teleport-ssm-publish-tokens.service entered failed state.
Mar 24 09:06:37 ip-172-0-1-202.eu-west-1.compute.internal systemd[1]: teleport-ssm-publish-tokens.service failed.

Running the specific bit that queries the local-hostname returns the same error as the the systemctl status showed:

IMDS_TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 300")
IMDS_TOKEN_HEADER="-H \"X-aws-ec2-metadata-token: ${IMDS_TOKEN}\""
NOW=$(date +%s)
TTL=$((NOW+3660))
PROCESS=$(curl -sS "${IMDS_TOKEN_HEADER}" http://169.254.169.254/latest/meta-data/local-hostname)

echo $PROCESS
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>401 - Unauthorized</title>
  </head>
  <body>
    <h1>401 - Unauthorized</h1>
  </body>
</html>

Code run from the AWS Documentation did work. I verified that the token I got in your could looked correct and also worked. But not in the curl command used in PROCESS. When I replaced PROCESS=$(curl -sS "${IMDS_TOKEN_HEADER}" http://169.254.169.254/latest/meta-data/local-hostname) with PROCESS=$(curl -sS -H "X-aws-ec2-metadata-token: ${IMDS_TOKEN}" http://169.254.169.254/latest/meta-data/local-hostname) the locking did work.

If I would then execute the teleport-get-certificate and the teleport-ssm-publish-tokens services my WebUI come through. On AWS all the healthchecks also started to succeed ( with the exception of mongodb, postgres and mysql as those are disabled in my config ).

Request

Can anybody verify my findings? I'll gladly make a PR with my fix. Locally I've also made some changes to use launch templates rather than launch configurations and a way to forward tags to the EC2 resources created by the ASG and to volumes created by the launch template. If you're interested I'll gladly push those back in (separate ) PRs as well. In the launch template I've also fixed the repetitive creation of new versions when you've made no changes due to the metadata { http_tokens = "required" } block.

zmb3 commented 1 year ago

Our AMIs don't support ARM platforms today, so they won't work on Graviton instances.

filipvh-sentia commented 1 year ago

Hi Zmb3,

This was on non-arm instances ( e.g.: t3.large ). So I'm not sure if the arm label is relevant?

tcsc commented 1 year ago

Definitely experienced this on amd64.

zmb3 commented 3 months ago

@hugoShaka do you know if this is still an issue?

hugoShaka commented 3 months ago

This was fixed in https://github.com/gravitational/teleport/pull/25295