hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.79k stars 9.14k forks source link

[Bug]: RDS Proxy provisioned through terraform drops connections due to unknown error #34747

Closed westrachel closed 10 months ago

westrachel commented 10 months ago

Terraform Core Version

1.6.4

AWS Provider Version

3.6.0, 5.25.0

Affected Resource(s)

aws_db_proxy, aws_db_proxy_default_target_group, aws_db_proxy_target

Expected Behavior

When I invoke a Lambda function that connects to an RDS Aurora PostgreSQL cluster that was provisioned through terraform I expect the lambda function to be able to maintain a connection to the database without errors so that it can execute SQL statements successfully.

Actual Behavior

When I invoke a Lambda function that connects to an RDS Aurora PostgreSQL cluster that was provisioned through Terraform the database connection is dropped when the lambda function tries to invoke SQL statements preventing the lambda function from doing meaningful work. The error is vague, but I believe there is a problem with the RDS proxy based on the steps I have taken to debug this.

Relevant Error/Panic Output Snippet

The Cloudwatch logs of the lambda function that's using the proxy to connect to the database shows the following error:

Unknown error. SSL connection has been closed unexpectedly

The proxy's Cloudwatch logs show the following messages:

A TCP connection was established from the proxy at <IP>:<PORT> to the database at <IP>:5432.
The database connection closed. Reason: An internal error occurred.

### Terraform Configuration Files

provider.tf file contents:
terraform {
  required_version = ">= 1.0"

  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

variables.tf file contents:
locals {
  dot_env_file_path = ".env"
  dot_env_regex     = "(?m:^\\s*([^#\\s]\\S*)\\s*=\\s*[\"']?(.*[^\"'\\s])[\"']?\\s*$)"
  dot_env           = { for tuple in regexall(local.dot_env_regex, file(local.dot_env_file_path)) : tuple[0] => sensitive(tuple[1]) }
  account_id        = local.dot_env["ACCOUNT_ID"]
  provider_name     = local.dot_env["PROVIDER_AWS_ROLE"]
  aws_key_id        = local.dot_env["AWS_KEY_ID"]
  aws_key_value     = local.dot_env["AWS_KEY_VALUE"]
  db_port           = local.dot_env["PORT"]
}

variable "region" {
  default = "string value of AWS region, replace this with an AWS region that's relevant to you"
}

variable "db_name" {
  default = "string value of the database name, replace this with whatever you want"
}

variable "db_username" {
  default = "string value of db username - replace this value with whatever you want"
}

variable "db_instance_type" {
  default = "replace with the string value of the AWS db instance type you want to use"
}

.env file needs to contain the following variables. Please replace the values with values appropriate for your AWS account. The port is the postgresql port, 5432. Also note that the role I temporarily created for the terraform provider to use was toggled to be overly permissive. Specifically, I created a temporary role that has an underlying policy that allows all actions across all resources.

ACCOUNT_ID= PROVIDER_AWS_ROLE= AWS_KEY_ID= AWS_KEY_VALUE= PORT=

Steps to Reproduce

(1) terraform plan and terraform apply all the following resources in steps. Note that the code below references local variables (var and local), whose values need to be configured (I show dummy versions of the files that have these contents under the Terraform Configuration Files section). Also, I say apply the following changes in steps, because they cannot all be applied at once. For example, the aws_iam_role_policy_attachments cannot be applied before some of the underlying policies being attached are created, because that will result in an error because the arns aren't available yet. I'm not currently showing the lambda function code for the rotating secret lambda or the create table lambda, but let me know if that's needed and I can provide a sample. The create table lambda is the lambda that is trying to connect to the RDS proxy to execute SQL statements.

main.tf file contents:

resource "aws_default_vpc" "default" {
  tags = {
    Name = "Default VPC"
  }
}

resource "aws_security_group" "rds_cluster" {
  vpc_id = aws_default_vpc.default.id
}

resource "aws_security_group_rule" "internal_vpc_ingress" {
  type = "ingress"
  from_port = 0
  to_port = 0
  protocol = "-1"
  cidr_blocks = [aws_default_vpc.default.cidr_block]
  security_group_id = aws_security_group.rds_cluster.id
}

resource "aws_security_group_rule" "public_egress" {
  type = "egress"
  from_port = 0
  to_port = 0
  protocol = "-1"
  cidr_blocks = ["0.0.0.0/0"]
  security_group_id = aws_security_group.rds_cluster.id
}

resource "aws_default_subnet" "default_az1" {
  availability_zone = "us-east-2a"

  tags = {
    Name = "Default subnet for us-east-2a"
  }
}

resource "aws_default_subnet" "default_az2" {
  availability_zone = "us-east-2b"

  tags = {
    Name = "Default subnet for us-east-2b"
  }
}

resource "aws_default_subnet" "default_az3" {
  availability_zone = "us-east-2c"

  tags = {
    Name = "Default subnet for us-east-2c"
  }
}

resource "aws_db_subnet_group" "rds_cluster" {
  subnet_ids = [
    aws_default_subnet.default_az1.id, 
    aws_default_subnet.default_az2.id, 
    aws_default_subnet.default_az3.id
  ]
}

 resource "random_password" "password" {
  length               = 16
  min_lower        = 1
  min_numeric    = 1
  min_upper        = 1
}

resource "aws_secretsmanager_secret" "rds_cluster_pw" {
  name = "initial_password_for_rds_cluster"
}

resource "aws_secretsmanager_secret_rotation" "rds_cluster_pw" {
  secret_id                    = aws_secretsmanager_secret.rds_cluster_pw.id
  rotation_lambda_arn = aws_lambda_function.rotate_secret_lambda.arn

  rotation_rules {
    automatically_after_days = 1
  }
}

resource "aws_secretsmanager_secret_version" "rd_cluster_pw_value" {
  secret_id          = aws_secretsmanager_secret.rds_cluster_pw.id
  secret_string   = jsonencode({
       username   = aws_rds_cluster.postgresql.master_username
       password   = aws_rds_cluster.postgresql.master_password
       dbname      = var.db_name
       engine        = "postgres"
       host            = aws_rds_cluster.postgresql.endpoint
  })
}

resource "aws_lambda_permission" "allow_rotate_secrets_permission" {
  statement_id      = "AllowExecutionFromSecretsManager"
  action                  = "lambda:InvokeFunction"
  function_name    = aws_lambda_function.rotate_secret_lambda.function_name
  principal              = "secretsmanager.amazonaws.com"
  source_arn         = aws_secretsmanager_secret.rds_cluster_pw.arn
}

resource "aws_rds_cluster" "postgresql" {
  cluster_identifier                = "${var.db_name}-rds-db-cluster"
  engine                                 = "aurora-postgresql"
  availability_zones                = ["us-east-2a", "us-east-2b", "us-east-2c"]
  database_name                   = var.db_name
  master_username                = var.db_username
  master_password                = random_password.password.result
  backup_retention_period    = 5   
  vpc_security_group_ids      = [aws_security_group.rds_cluster.id]
  db_subnet_group_name     = aws_db_subnet_group.rds_cluster.name
  # preferred_backup_window  = "07:00-09:00"
  skip_final_snapshot            = true
  preferred_maintenance_window   = "wed:02:00-wed:02:30"
}

resource "aws_rds_cluster_instance" "cluster_instances" {
  count                           = 2
  identifier                      = "poc-aurora-cluster-instance-${count.index}"
  cluster_identifier         = aws_rds_cluster.postgresql.id
  instance_class            = var.db_instance_type
  engine                         = aws_rds_cluster.postgresql.engine
  engine_version           = aws_rds_cluster.postgresql.engine_version
  preferred_maintenance_window    = "wed:02:00-wed:02:30"
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id                              = aws_default_vpc.default.id
  service_name                  = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type         = "Interface"
  private_dns_enabled      = true
  subnet_ids                      = [aws_default_subnet.default_az1.id]
  security_group_ids         = [aws_security_group.rds_cluster.id]
}

resource "aws_lambda_function" "rotate_secret_lambda" {
  filename                    = "${path.module}/lambda/zip/rotating_lambda_function.zip"
  function_name          = "rotate_rds_cluster_secret_lambda"
  role                            = aws_iam_role.rotate_secret_lambda_role.arn

  runtime                      = "python3.9"
  handler                      = "rotating_lambda.lambda_handler"
  timeout                      = 900

  depends_on            = [aws_iam_role.rotate_secret_lambda_role]

  vpc_config {
    security_group_ids = [aws_security_group.rds_cluster.id]
    subnet_ids         = [aws_default_subnet.default_az1.id]
  }

  environment {
    variables = {
      SECRETS_MANAGER_ENDPOINT = "https://secretsmanager.${var.region}.amazonaws.com"
    }
  }
}

data "aws_iam_policy_document" "lambda_service_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_role" "rotate_secret_lambda_role" {
  name = "rotating_lambda_role"

  assume_role_policy = data.aws_iam_policy_document.lambda_service_role_policy.json
}

data "aws_iam_policy" "lambda_vpc_access_policy" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_policy" "rotating_secret_lambda_policy" {
  name        = "lambda_secrets_policy"
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "secretsmanager:DescribeSecret",
              "secretsmanager:GetSecretValue",
              "secretsmanager:PutSecretValue",
              "secretsmanager:UpdateSecretVersionStage"
          ],
          "Resource": aws_secretsmanager_secret.rds_cluster_pw.arn
      },
      {
          "Effect": "Allow",
          "Action": [
              "secretsmanager:GetRandomPassword"
          ],
          "Resource": "*"
      },
      {
          "Action": [
              "ec2:CreateNetworkInterface",
              "ec2:DeleteNetworkInterface",
              "ec2:DescribeNetworkInterfaces",
              "ec2:DetachNetworkInterface"
          ],
          "Resource": "*",
          "Effect": "Allow"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "rotating_lambda_role_policy_attachment" {
  for_each = toset([
    data.aws_iam_policy.lambda_vpc_access_policy.arn,
    resource.aws_iam_policy.rotating_secret_lambda_policy.arn
  ])

  role            = aws_iam_role.rotate_secret_lambda_role.name
  policy_arn = each.value
}

data "aws_iam_policy_document" "rds_service_role" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["rds.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_policy" "retrieve_rds_secret_policy" {
  name        = "retrieve_rds_secret_policy"
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "secretsmanager:GetSecretValue"
        ],
        "Resource": aws_secretsmanager_secret.rds_cluster_pw.arn
      },
    ]
  })
}

resource "aws_iam_role" "rds_proxy_role" {
  name               = "retrieve_rds_secret_policy"
  assume_role_policy = data.aws_iam_policy_document.rds_service_role.json
}

resource "aws_iam_role_policy_attachment" "rds_proxy_secrets_manager_permission_attachment" {
  role       = aws_iam_role.rds_proxy_role.name
  policy_arn = aws_iam_policy.retrieve_rds_secret_policy.arn
}

resource "aws_db_proxy" "yeti_proxy" {
  name                             = "yeti-proxy"
  debug_logging             = false
  engine_family               = "POSTGRESQL"
  idle_client_timeout       = 1800
  role_arn                         = aws_iam_role.rds_proxy_role.arn
  vpc_security_group_ids  = [aws_security_group.rds_cluster.id]
  vpc_subnet_ids             = [aws_default_subnet.default_az1.id, 
                                              aws_default_subnet.default_az2.id,
                                              aws_default_subnet.default_az3.id]

  auth {
    auth_scheme   = "SECRETS"
    iam_auth      = "DISABLED"
    secret_arn    = aws_secretsmanager_secret.rds_cluster_pw.arn
  }
}

resource "aws_db_proxy_default_target_group" "yeti_cluster_proxy" {
  db_proxy_name = aws_db_proxy.yeti_proxy.name

  connection_pool_config {
    connection_borrow_timeout    = 120
    init_query                   = "SET x=1, y=2"
    max_connections_percent      = 100
    max_idle_connections_percent = 50
  }
}

resource "aws_db_proxy_target" "example" {
  db_cluster_identifier  = aws_rds_cluster.postgresql.id
  db_proxy_name          = aws_db_proxy.yeti_proxy.name
  target_group_name      = aws_db_proxy_default_target_group.yeti_cluster_proxy.name
}

resource "aws_lambda_function" "create_table_lambda" {
  filename              = "${path.module}/lambda/zip/create_table_lambda_function.zip"
  function_name   = "create_table_lambda"
  role                     = aws_iam_role.modify_rds_tables_lambda_role.arn

  runtime               = "python3.9"
  handler               = "create_table_lambda.lambda_handler"
  timeout               = 900

  depends_on       = [aws_rds_cluster.postgresql,
                                 aws_rds_cluster_instance.cluster_instances,
                                 aws_db_proxy.yeti_proxy]

  vpc_config {
    security_group_ids = [aws_security_group.rds_cluster.id]
    subnet_ids         = [aws_default_subnet.default_az1.id, 
                                   aws_default_subnet.default_az2.id,
                                   aws_default_subnet.default_az3.id]
  }

  environment {
    variables = {
      SECRET_NAME        = aws_secretsmanager_secret_rotation.rds_cluster_pw.id,
      RDS_PROXY_ENDPOINT = aws_db_proxy.yeti_proxy.endpoint,
      REGION             = var.region
    }
  }
}

data "aws_iam_policy_document" "rds_proxy_connection_permission" {
  statement {
    effect = "Allow"
    actions = ["rds-db:connect"]
    resources = ["arn:aws:rds-db:${var.region}:${local.account_id}:dbuser:{aws_rds_cluster.postgresql.cluster_resource_id}/*"]
  }
}

resource "aws_iam_role" "modify_rds_tables_lambda_role" {
  name = "modify_rds_tables_lambda_role"

  assume_role_policy = data.aws_iam_policy_document.lambda_service_role_policy.json
}

resource "aws_iam_policy" "rds_proxy_connection_policy" {
  name   = "rds_connection_policy"
  policy = data.aws_iam_policy_document.rds_proxy_connection_permission.json
}

resource "aws_iam_role_policy_attachment" "lambda_cloudwatch_rds_proxy_permission" {
  for_each = toset([
    data.aws_iam_policy.lambda_vpc_access_policy.arn,
    resource.aws_iam_policy.rds_proxy_connection_policy.arn,
    aws_iam_policy.retrieve_rds_secret_policy.arn
  ])
  role       = aws_iam_role.modify_rds_tables_lambda_role.name
  policy_arn = each.value
}

resource "aws_lambda_invocation" "create_table" {
  function_name = aws_lambda_function.create_table_lambda.function_name

  input = jsonencode({})   
}

(2) Within the Cloudwatch management console, navigate to the logs for the create table lambda function and the proxy that's created to see the errors.

Debug Output

No response

Panic Output

No response

Important Factoids

Here are the things I have checked/done to try to debug and prevent this error from happening:

References

I referenced the following AWS docs while trying to debug this:

Would you like to implement a fix?

None

github-actions[bot] commented 10 months ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

trevorrea commented 10 months ago

Hi,

I haven't read the whole issue in great detail but a couple of things stood out. You said "Confirm that the RDS proxy and lambda functions have appropriate permissions through IAM that enable them to do what they need to do to communicate properly; I believe they do. For example, the lambda function that's trying to connect to the proxy does have the rds-db:connect permission allowed for it to connect to the RDS cluster|

Is this relevant as you're not using IAM auth to connect to the RDS cluster so your Lambda doesn't need these permissions anyway?

On your DB proxy can you try:-

resource "aws_db_proxy" "yeti_proxy" {
  name                             = "yeti-proxy"
  debug_logging             = false
  engine_family               = "POSTGRESQL"
  idle_client_timeout       = 1800
  role_arn                         = aws_iam_role.rds_proxy_role.arn
  vpc_security_group_ids  = [aws_security_group.rds_cluster.id]
  vpc_subnet_ids             = [aws_default_subnet.default_az1.id, 
                                              aws_default_subnet.default_az2.id,
                                              aws_default_subnet.default_az3.id]
  auth {
    auth_scheme   = "SECRETS"
    client_password_auth_type = "POSTGRES_MD5"
    iam_auth      = "DISABLED"
    secret_arn    = aws_secretsmanager_secret.rds_cluster_pw.arn
  }
}

Note - I have added client_password_auth_type = "POSTGRES_MD5" to the auth block. For Postgres it seems to default to POSTGRES_SCRAM_SHA_256 which has in the past shown similar behaviour as you've described here.

Try that and see if it makes any difference and/or compare your console created one versus the Terraform created one.

westrachel commented 10 months ago

Thank you for the suggestion! You're right the rds-db:connect doesn't matter for this. I had mixed up in my notes that it was required regardless of the authentication mode, but it's only for IAM authentication mode, which I'm not currently using.

I added the attribute assignment you suggested, client_password_auth_type = "POSTGRES_MD5" and unfortunately it didn't make a difference. I re-invoked the lambda function and it still resulted in the following error message viewable in CloudWatch.

Unknown error. SSL connection has been closed unexpectedly

I also have enhanced logging on for the proxy, but this auth change combined with enhanced logging didn't result in a more informative error in the Cloudwatch logs; the proxy logs just show:

Proxy authentication with PostgreSQL native password authentication succeeded for user <var.db_username> with TLS on.
A TCP connection was established from the proxy at <IP>:<PORT> to the database at <IP>:5432.
The new database connection successfully authenticated with TLS on.
The database connection closed. Reason: An internal error occurred.

For additional reference, the proxy I had temporarily created in the console that facilitated connections without error was using SCRAM SHA 256 for the client authentication type instead of PostgreSQL MD5, which is what client_password_auth_type = "POSTGRES_MD5" configures.

trevorrea commented 10 months ago

Weird..have you opened an AWS support ticket to ask them if they can see what's going on?

Maybe try client_password_auth_type = "POSTGRES_SCRAM_SHA_256" but I think that's the default anyway?

Also are you using the exact same secret and IAM roles / policies with the manually created proxy? I think your aws_iam_policy resource retrieve_rds_secret_policy is missing KMS permissions per the docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy-setup.html#rds-proxy-iam-setup so possibly the proxy can't actually read the secret? But that doesn't make sense as the logs seem to suggest it can?

In summary - weird problem. I don't know what's wrong. Sorry!

westrachel commented 10 months ago

I wasn't aware of the AWS support ticket option. I will give that a shot! Thank you for the idea!

You're right, SCRAM SHA 256 is the default auth type. Both my terraform proxy were initially using that authentication mode before your suggestion to see what happens when toggling that configuration.

I am assigning the 2 proxies the same role that I'm creating through terraform and I've studied the console configuration details across both and they have the same configurations for all the settings. There was some example AWS document that suggested kms permissions were necessary only if you're using a custom KMS key, which I'm currently not. Since the console proxy works with the terraform configured role that doesn't have the underlying kms permission attached through a policy, I don't think adding that should make a difference. The terraform proxy logs I've included above explicitly say Proxy authentication with PostgreSQL native password authentication succeeded for my db user, suggesting it can parse the secret okay; if it couldn't I'd expect the logs to show an error message like the one in this forum about it not being able to retrieve a secret.

westrachel commented 10 months ago

Okay, I never got a response about the AWS support ticket I opened. However, I realized that I could enable logs for the db instances in addition to all the other (enhanced) logs that I already had enabled for other components. Looking at the db instance logs, I can see errors related to the init_query's "SET x=1, y=2". I have that in the configuration of the aws_db_proxy_default_target_group resource b/c I never adjusted it from the example in the aws terraform docs. I technically don't need it and that attribute is optional. Post removing it, the lambda function is able to connect to the RDS db proxy created through terraform and execute SQL statements successfully.

github-actions[bot] commented 9 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.