westrachel commented 10 months ago

Terraform Core Version

1.6.4

AWS Provider Version

3.6.0, 5.25.0

Affected Resource(s)

aws_db_proxy, aws_db_proxy_default_target_group, aws_db_proxy_target

Expected Behavior

When I invoke a Lambda function that connects to an RDS Aurora PostgreSQL cluster that was provisioned through terraform I expect the lambda function to be able to maintain a connection to the database without errors so that it can execute SQL statements successfully.

Actual Behavior

When I invoke a Lambda function that connects to an RDS Aurora PostgreSQL cluster that was provisioned through Terraform the database connection is dropped when the lambda function tries to invoke SQL statements preventing the lambda function from doing meaningful work. The error is vague, but I believe there is a problem with the RDS proxy based on the steps I have taken to debug this.

Relevant Error/Panic Output Snippet

The Cloudwatch logs of the lambda function that's using the proxy to connect to the database shows the following error:

Unknown error. SSL connection has been closed unexpectedly

The proxy's Cloudwatch logs show the following messages:

A TCP connection was established from the proxy at <IP>:<PORT> to the database at <IP>:5432.
The database connection closed. Reason: An internal error occurred.


### Terraform Configuration Files

provider.tf file contents:
terraform {
  required_version = ">= 1.0"

  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

variables.tf file contents:
locals {
  dot_env_file_path = ".env"
  dot_env_regex     = "(?m:^\\s*([^#\\s]\\S*)\\s*=\\s*[\"']?(.*[^\"'\\s])[\"']?\\s*$)"
  dot_env           = { for tuple in regexall(local.dot_env_regex, file(local.dot_env_file_path)) : tuple[0] => sensitive(tuple[1]) }
  account_id        = local.dot_env["ACCOUNT_ID"]
  provider_name     = local.dot_env["PROVIDER_AWS_ROLE"]
  aws_key_id        = local.dot_env["AWS_KEY_ID"]
  aws_key_value     = local.dot_env["AWS_KEY_VALUE"]
  db_port           = local.dot_env["PORT"]
}

variable "region" {
  default = "string value of AWS region, replace this with an AWS region that's relevant to you"
}

variable "db_name" {
  default = "string value of the database name, replace this with whatever you want"
}

variable "db_username" {
  default = "string value of db username - replace this value with whatever you want"
}

variable "db_instance_type" {
  default = "replace with the string value of the AWS db instance type you want to use"
}

.env file needs to contain the following variables. Please replace the values with values appropriate for your AWS account. The port is the postgresql port, 5432. Also note that the role I temporarily created for the terraform provider to use was toggled to be overly permissive. Specifically, I created a temporary role that has an underlying policy that allows all actions across all resources.

ACCOUNT_ID= PROVIDER_AWS_ROLE= AWS_KEY_ID= AWS_KEY_VALUE= PORT=

Steps to Reproduce

(1) terraform plan and terraform apply all the following resources in steps. Note that the code below references local variables (var and local), whose values need to be configured (I show dummy versions of the files that have these contents under the Terraform Configuration Files section). Also, I say apply the following changes in steps, because they cannot all be applied at once. For example, the aws_iam_role_policy_attachments cannot be applied before some of the underlying policies being attached are created, because that will result in an error because the arns aren't available yet. I'm not currently showing the lambda function code for the rotating secret lambda or the create table lambda, but let me know if that's needed and I can provide a sample. The create table lambda is the lambda that is trying to connect to the RDS proxy to execute SQL statements.

main.tf file contents:

resource "aws_default_vpc" "default" {
  tags = {
    Name = "Default VPC"
  }
}

resource "aws_security_group" "rds_cluster" {
  vpc_id = aws_default_vpc.default.id
}

resource "aws_security_group_rule" "internal_vpc_ingress" {
  type = "ingress"
  from_port = 0
  to_port = 0
  protocol = "-1"
  cidr_blocks = [aws_default_vpc.default.cidr_block]
  security_group_id = aws_security_group.rds_cluster.id
}

resource "aws_security_group_rule" "public_egress" {
  type = "egress"
  from_port = 0
  to_port = 0
  protocol = "-1"
  cidr_blocks = ["0.0.0.0/0"]
  security_group_id = aws_security_group.rds_cluster.id
}

resource "aws_default_subnet" "default_az1" {
  availability_zone = "us-east-2a"

  tags = {
    Name = "Default subnet for us-east-2a"
  }
}

resource "aws_default_subnet" "default_az2" {
  availability_zone = "us-east-2b"

  tags = {
    Name = "Default subnet for us-east-2b"
  }
}

resource "aws_default_subnet" "default_az3" {
  availability_zone = "us-east-2c"

  tags = {
    Name = "Default subnet for us-east-2c"
  }
}

resource "aws_db_subnet_group" "rds_cluster" {
  subnet_ids = [
    aws_default_subnet.default_az1.id, 
    aws_default_subnet.default_az2.id, 
    aws_default_subnet.default_az3.id
  ]
}

 resource "random_password" "password" {
  length               = 16
  min_lower        = 1
  min_numeric    = 1
  min_upper        = 1
}

resource "aws_secretsmanager_secret" "rds_cluster_pw" {
  name = "initial_password_for_rds_cluster"
}

resource "aws_secretsmanager_secret_rotation" "rds_cluster_pw" {
  secret_id                    = aws_secretsmanager_secret.rds_cluster_pw.id
  rotation_lambda_arn = aws_lambda_function.rotate_secret_lambda.arn

  rotation_rules {
    automatically_after_days = 1
  }
}

resource "aws_secretsmanager_secret_version" "rd_cluster_pw_value" {
  secret_id          = aws_secretsmanager_secret.rds_cluster_pw.id
  secret_string   = jsonencode({
       username   = aws_rds_cluster.postgresql.master_username
       password   = aws_rds_cluster.postgresql.master_password
       dbname      = var.db_name
       engine        = "postgres"
       host            = aws_rds_cluster.postgresql.endpoint
  })
}

resource "aws_lambda_permission" "allow_rotate_secrets_permission" {
  statement_id      = "AllowExecutionFromSecretsManager"
  action                  = "lambda:InvokeFunction"
  function_name    = aws_lambda_function.rotate_secret_lambda.function_name
  principal              = "secretsmanager.amazonaws.com"
  source_arn         = aws_secretsmanager_secret.rds_cluster_pw.arn
}

resource "aws_rds_cluster" "postgresql" {
  cluster_identifier                = "${var.db_name}-rds-db-cluster"
  engine                                 = "aurora-postgresql"
  availability_zones                = ["us-east-2a", "us-east-2b", "us-east-2c"]
  database_name                   = var.db_name
  master_username                = var.db_username
  master_password                = random_password.password.result
  backup_retention_period    = 5   
  vpc_security_group_ids      = [aws_security_group.rds_cluster.id]
  db_subnet_group_name     = aws_db_subnet_group.rds_cluster.name
  # preferred_backup_window  = "07:00-09:00"
  skip_final_snapshot            = true
  preferred_maintenance_window   = "wed:02:00-wed:02:30"
}

resource "aws_rds_cluster_instance" "cluster_instances" {
  count                           = 2
  identifier                      = "poc-aurora-cluster-instance-${count.index}"
  cluster_identifier         = aws_rds_cluster.postgresql.id
  instance_class            = var.db_instance_type
  engine                         = aws_rds_cluster.postgresql.engine
  engine_version           = aws_rds_cluster.postgresql.engine_version
  preferred_maintenance_window    = "wed:02:00-wed:02:30"
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id                              = aws_default_vpc.default.id
  service_name                  = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type         = "Interface"
  private_dns_enabled      = true
  subnet_ids                      = [aws_default_subnet.default_az1.id]
  security_group_ids         = [aws_security_group.rds_cluster.id]
}

resource "aws_lambda_function" "rotate_secret_lambda" {
  filename                    = "${path.module}/lambda/zip/rotating_lambda_function.zip"
  function_name          = "rotate_rds_cluster_secret_lambda"
  role                            = aws_iam_role.rotate_secret_lambda_role.arn

  runtime                      = "python3.9"
  handler                      = "rotating_lambda.lambda_handler"
  timeout                      = 900

  depends_on            = [aws_iam_role.rotate_secret_lambda_role]

  vpc_config {
    security_group_ids = [aws_security_group.rds_cluster.id]
    subnet_ids         = [aws_default_subnet.default_az1.id]
  }

  environment {
    variables = {
      SECRETS_MANAGER_ENDPOINT = "https://secretsmanager.${var.region}.amazonaws.com"
    }
  }
}

data "aws_iam_policy_document" "lambda_service_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_role" "rotate_secret_lambda_role" {
  name = "rotating_lambda_role"

  assume_role_policy = data.aws_iam_policy_document.lambda_service_role_policy.json
}

data "aws_iam_policy" "lambda_vpc_access_policy" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_policy" "rotating_secret_lambda_policy" {
  name        = "lambda_secrets_policy"
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
      {
          "Effect": "Allow",
          "Action": [
              "secretsmanager:DescribeSecret",
              "secretsmanager:GetSecretValue",
              "secretsmanager:PutSecretValue",
              "secretsmanager:UpdateSecretVersionStage"
          ],
          "Resource": aws_secretsmanager_secret.rds_cluster_pw.arn
      },
      {
          "Effect": "Allow",
          "Action": [
              "secretsmanager:GetRandomPassword"
          ],
          "Resource": "*"
      },
      {
          "Action": [
              "ec2:CreateNetworkInterface",
              "ec2:DeleteNetworkInterface",
              "ec2:DescribeNetworkInterfaces",
              "ec2:DetachNetworkInterface"
          ],
          "Resource": "*",
          "Effect": "Allow"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "rotating_lambda_role_policy_attachment" {
  for_each = toset([
    data.aws_iam_policy.lambda_vpc_access_policy.arn,
    resource.aws_iam_policy.rotating_secret_lambda_policy.arn
  ])

  role            = aws_iam_role.rotate_secret_lambda_role.name
  policy_arn = each.value
}

data "aws_iam_policy_document" "rds_service_role" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["rds.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_policy" "retrieve_rds_secret_policy" {
  name        = "retrieve_rds_secret_policy"
  policy = jsonencode({
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "secretsmanager:GetSecretValue"
        ],
        "Resource": aws_secretsmanager_secret.rds_cluster_pw.arn
      },
    ]
  })
}

resource "aws_iam_role" "rds_proxy_role" {
  name               = "retrieve_rds_secret_policy"
  assume_role_policy = data.aws_iam_policy_document.rds_service_role.json
}

resource "aws_iam_role_policy_attachment" "rds_proxy_secrets_manager_permission_attachment" {
  role       = aws_iam_role.rds_proxy_role.name
  policy_arn = aws_iam_policy.retrieve_rds_secret_policy.arn
}

resource "aws_db_proxy" "yeti_proxy" {
  name                             = "yeti-proxy"
  debug_logging             = false
  engine_family               = "POSTGRESQL"
  idle_client_timeout       = 1800
  role_arn                         = aws_iam_role.rds_proxy_role.arn
  vpc_security_group_ids  = [aws_security_group.rds_cluster.id]
  vpc_subnet_ids             = [aws_default_subnet.default_az1.id, 
                                              aws_default_subnet.default_az2.id,
                                              aws_default_subnet.default_az3.id]

  auth {
    auth_scheme   = "SECRETS"
    iam_auth      = "DISABLED"
    secret_arn    = aws_secretsmanager_secret.rds_cluster_pw.arn
  }
}

resource "aws_db_proxy_default_target_group" "yeti_cluster_proxy" {
  db_proxy_name = aws_db_proxy.yeti_proxy.name

  connection_pool_config {
    connection_borrow_timeout    = 120
    init_query                   = "SET x=1, y=2"
    max_connections_percent      = 100
    max_idle_connections_percent = 50
  }
}

resource "aws_db_proxy_target" "example" {
  db_cluster_identifier  = aws_rds_cluster.postgresql.id
  db_proxy_name          = aws_db_proxy.yeti_proxy.name
  target_group_name      = aws_db_proxy_default_target_group.yeti_cluster_proxy.name
}

resource "aws_lambda_function" "create_table_lambda" {
  filename              = "${path.module}/lambda/zip/create_table_lambda_function.zip"
  function_name   = "create_table_lambda"
  role                     = aws_iam_role.modify_rds_tables_lambda_role.arn

  runtime               = "python3.9"
  handler               = "create_table_lambda.lambda_handler"
  timeout               = 900

  depends_on       = [aws_rds_cluster.postgresql,
                                 aws_rds_cluster_instance.cluster_instances,
                                 aws_db_proxy.yeti_proxy]

  vpc_config {
    security_group_ids = [aws_security_group.rds_cluster.id]
    subnet_ids         = [aws_default_subnet.default_az1.id, 
                                   aws_default_subnet.default_az2.id,
                                   aws_default_subnet.default_az3.id]
  }

  environment {
    variables = {
      SECRET_NAME        = aws_secretsmanager_secret_rotation.rds_cluster_pw.id,
      RDS_PROXY_ENDPOINT = aws_db_proxy.yeti_proxy.endpoint,
      REGION             = var.region
    }
  }
}

data "aws_iam_policy_document" "rds_proxy_connection_permission" {
  statement {
    effect = "Allow"
    actions = ["rds-db:connect"]
    resources = ["arn:aws:rds-db:${var.region}:${local.account_id}:dbuser:{aws_rds_cluster.postgresql.cluster_resource_id}/*"]
  }
}

resource "aws_iam_role" "modify_rds_tables_lambda_role" {
  name = "modify_rds_tables_lambda_role"

  assume_role_policy = data.aws_iam_policy_document.lambda_service_role_policy.json
}

resource "aws_iam_policy" "rds_proxy_connection_policy" {
  name   = "rds_connection_policy"
  policy = data.aws_iam_policy_document.rds_proxy_connection_permission.json
}

resource "aws_iam_role_policy_attachment" "lambda_cloudwatch_rds_proxy_permission" {
  for_each = toset([
    data.aws_iam_policy.lambda_vpc_access_policy.arn,
    resource.aws_iam_policy.rds_proxy_connection_policy.arn,
    aws_iam_policy.retrieve_rds_secret_policy.arn
  ])
  role       = aws_iam_role.modify_rds_tables_lambda_role.name
  policy_arn = each.value
}

resource "aws_lambda_invocation" "create_table" {
  function_name = aws_lambda_function.create_table_lambda.function_name

  input = jsonencode({})   
}

(2) Within the Cloudwatch management console, navigate to the logs for the create table lambda function and the proxy that's created to see the errors.

Debug Output

No response

Panic Output

No response

Important Factoids

Here are the things I have checked/done to try to debug and prevent this error from happening:

Confirm that the RDS proxy, RDS cluster, and lambda function are configured in the same VPC and have access to the same subnets
Confirm that the security group configured for these resources allow them to communicate with each other
Confirm that the RDS proxy and lambda functions have appropriate permissions through IAM that enable them to do what they need to do to communicate properly; I believe they do. For example, the lambda function that's trying to connect to the proxy does have the rds-db:connect permission allowed for it to connect to the RDS cluster
Confirm that the RDS cluster instances are Available and running when trying to connect through the RDS proxy provisioned through terraform
Try updating to a later version of AWS terraform (I was initially using 3.76.1 and updated to a 5.X.X version and that did not make the error go away)
Confirm the IAM role that the terraform provider is assuming has all the appropriate permissions to provision and configure the resources that I'm trying to provision. I temporarily relaxed the permission's underlying the terraform provider's assumed IAM role to allow it to do all actions for all resources (which is un-ideal from a principle of least privilege perspective, but I just wanted to see if it would help). This did not fix the error.
Increase the idle timeout of the RDS proxy. I don't think this even matters, because I believe the default RDS proxy timeout is 30 minutes, which is longer than how long the lambda function I'm invoking can possibly run for. However, I saw it in an aws forum post discussion and decided it couldn't hurt to try it.

I turned on enhanced logging for the proxy and re-invoked the lambda. The cloudwatch logs for the proxy with enhanced logging don't provide more explicit information about the internal error. Specifically, the cloudwatch logs with enhanced logging turned on for the proxy are:

Proxy authentication with PostgreSQL native password authentication succeeded for user <my_db_username> with TLS on.
A TCP connection was established from the proxy at <IP>:<PORT> to the database at <IP>:5432.
The new database connection successfully authenticated with TLS on.
The database connection closed. Reason: An internal error occurred.

I have re-created what I believe to be the same RDS proxy through the console and the error went away. Specifically, I created an RDS proxy through the RDS AWS management console that was assigned to the same VPC, same subnets, same IAM role created through terraform, same terraform created security group and that had the same underlying configurations as the RDS proxy created through terraform (e.g. same idle connection timeout). I then updated the secret associated with the lambda that I'm creating through terraform to connect to this console RDS proxy instead of the terraform provisioned RDS proxy. I then invoked the updated lambda function through adding another invocation to my terraform code. That invocation ran without any errors. Specifically, the terraform created lambda can connect and maintain a connection to the RDS cluster and execute SQL statements successfully when it connects through the RDS proxy that is created through the console. Based on this observation, I think there is something off with how the proxy is created through terraform, because as far as I can tell I have created 2 seemingly identical RDS proxies, one through terraform and one through the console, and the one created through terraform results in a vague internal error disrupting the connection, whereas the proxy through the console facilitates a connection to the RDS cluster db instances successfully without error.

References

I referenced the following AWS docs while trying to debug this:

Would you like to implement a fix?

None

github-actions[bot] commented 10 months ago

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

trevorrea commented 10 months ago

Hi,

I haven't read the whole issue in great detail but a couple of things stood out. You said "Confirm that the RDS proxy and lambda functions have appropriate permissions through IAM that enable them to do what they need to do to communicate properly; I believe they do. For example, the lambda function that's trying to connect to the proxy does have the rds-db:connect permission allowed for it to connect to the RDS cluster|

Is this relevant as you're not using IAM auth to connect to the RDS cluster so your Lambda doesn't need these permissions anyway?

On your DB proxy can you try:-

resource "aws_db_proxy" "yeti_proxy" {
  name                             = "yeti-proxy"
  debug_logging             = false
  engine_family               = "POSTGRESQL"
  idle_client_timeout       = 1800
  role_arn                         = aws_iam_role.rds_proxy_role.arn
  vpc_security_group_ids  = [aws_security_group.rds_cluster.id]
  vpc_subnet_ids             = [aws_default_subnet.default_az1.id, 
                                              aws_default_subnet.default_az2.id,
                                              aws_default_subnet.default_az3.id]
  auth {
    auth_scheme   = "SECRETS"
    client_password_auth_type = "POSTGRES_MD5"
    iam_auth      = "DISABLED"
    secret_arn    = aws_secretsmanager_secret.rds_cluster_pw.arn
  }
}

Note - I have added client_password_auth_type = "POSTGRES_MD5" to the auth block. For Postgres it seems to default to POSTGRES_SCRAM_SHA_256 which has in the past shown similar behaviour as you've described here.

Try that and see if it makes any difference and/or compare your console created one versus the Terraform created one.

westrachel commented 10 months ago

Thank you for the suggestion! You're right the rds-db:connect doesn't matter for this. I had mixed up in my notes that it was required regardless of the authentication mode, but it's only for IAM authentication mode, which I'm not currently using.

I added the attribute assignment you suggested, client_password_auth_type = "POSTGRES_MD5" and unfortunately it didn't make a difference. I re-invoked the lambda function and it still resulted in the following error message viewable in CloudWatch.

Unknown error. SSL connection has been closed unexpectedly

I also have enhanced logging on for the proxy, but this auth change combined with enhanced logging didn't result in a more informative error in the Cloudwatch logs; the proxy logs just show:

Proxy authentication with PostgreSQL native password authentication succeeded for user <var.db_username> with TLS on.
A TCP connection was established from the proxy at <IP>:<PORT> to the database at <IP>:5432.
The new database connection successfully authenticated with TLS on.
The database connection closed. Reason: An internal error occurred.

For additional reference, the proxy I had temporarily created in the console that facilitated connections without error was using SCRAM SHA 256 for the client authentication type instead of PostgreSQL MD5, which is what client_password_auth_type = "POSTGRES_MD5" configures.

trevorrea commented 10 months ago

Weird..have you opened an AWS support ticket to ask them if they can see what's going on?

Maybe try client_password_auth_type = "POSTGRES_SCRAM_SHA_256" but I think that's the default anyway?

Also are you using the exact same secret and IAM roles / policies with the manually created proxy? I think your aws_iam_policy resource retrieve_rds_secret_policy is missing KMS permissions per the docs at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy-setup.html#rds-proxy-iam-setup so possibly the proxy can't actually read the secret? But that doesn't make sense as the logs seem to suggest it can?

In summary - weird problem. I don't know what's wrong. Sorry!

westrachel commented 10 months ago

I wasn't aware of the AWS support ticket option. I will give that a shot! Thank you for the idea!

You're right, SCRAM SHA 256 is the default auth type. Both my terraform proxy were initially using that authentication mode before your suggestion to see what happens when toggling that configuration.

I am assigning the 2 proxies the same role that I'm creating through terraform and I've studied the console configuration details across both and they have the same configurations for all the settings. There was some example AWS document that suggested kms permissions were necessary only if you're using a custom KMS key, which I'm currently not. Since the console proxy works with the terraform configured role that doesn't have the underlying kms permission attached through a policy, I don't think adding that should make a difference. The terraform proxy logs I've included above explicitly say Proxy authentication with PostgreSQL native password authentication succeeded for my db user, suggesting it can parse the secret okay; if it couldn't I'd expect the logs to show an error message like the one in this forum about it not being able to retrieve a secret.

westrachel commented 10 months ago

Okay, I never got a response about the AWS support ticket I opened. However, I realized that I could enable logs for the db instances in addition to all the other (enhanced) logs that I already had enabled for other components. Looking at the db instance logs, I can see errors related to the init_query's "SET x=1, y=2". I have that in the configuration of the aws_db_proxy_default_target_group resource b/c I never adjusted it from the example in the aws terraform docs. I technically don't need it and that attribute is optional. Post removing it, the lambda function is able to connect to the RDS db proxy created through terraform and execute SQL statements successfully.

github-actions[bot] commented 9 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / terraform-provider-aws

[Bug]: RDS Proxy provisioned through terraform drops connections due to unknown error #34747

Terraform Core Version

AWS Provider Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Relevant Error/Panic Output Snippet

Steps to Reproduce

Debug Output

Panic Output

Important Factoids

References

Would you like to implement a fix?

Community Note