hashicorp / terraform-aws-consul-ecs

Consul Service Mesh on AWS ECS (Elastic Container Service)
https://www.consul.io/docs/ecs
Mozilla Public License 2.0
52 stars 31 forks source link

grpc tls issue with ecs controller 0.8.0 #303

Closed loungerider closed 6 months ago

loungerider commented 6 months ago

Hello all, we are seeing the following tls communication issue when deploying the ecs controller with tls enabled.

Consul server version Consul v1.17.0 Revision 4e3f428b Build Date 2023-11-03T14:56:56Z Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

ecs controller image variable "consul_ecs_image" { description = "Consul ECS image to use in all tasks." type = string default = "hashicorp/consul-ecs:0.8.0" }

The ecs controller is running as a fargate task and we see the following error in the logs

[ERROR] connection error: error="fetching supported dataplane features: rpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: remote error: tls: bad certificate\""

When running consul montior with trace logging on the server we see the corresponding server side error

2024-03-20T20:37:12.469Z [TRACE] agent: [core][Server #2] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"10.255.186.202:46110\") failed: tls: client didn't provide a certificate"

We are passing the following to the controller

  source = "hashicorp/consul-ecs/aws//modules/controller"
  version = "0.8.0"

  name_prefix         = var.name
  ecs_cluster_arn     = var.ecs_cluster_arn
  region              = var.region
  subnets             = var.private_subnets
  consul_server_hosts = var.consul_server_hosts
  consul_ca_cert_arn  = var.consul_ca_cert_arn
  launch_type         = "FARGATE"

  consul_bootstrap_token_secret_arn = var.consul_server_bootstrap_token_arn

  log_configuration = {
    logDriver = "awslogs"
    options = {
      awslogs-group         = var.log_group_name
      awslogs-region        = var.region
      awslogs-stream-prefix = "consul-controller"
    }
  }

  consul_ecs_image = var.consul_ecs_image
  tls              = true
}

Are we missing something on the client side configuration?

Ganeshrockz commented 6 months ago

👋 @loungerider The configuration looks correct to me. Can you verify from the ECS UI if the CONSUL_GRPC_CACERT_PEM environment variable is populated correctly?

loungerider commented 6 months ago

Hi @Ganeshrockz yes I can confirm that CONSUL_GRPC_CACERT_PEM is set correctly.

We did some digging and looking at the example posted here https://github.com/hashicorp/terraform-aws-consul-ecs/tree/main/examples/locality-aware-routing and tracing the dev server config back to the module. We noticed that verify_incoming is only set for tls internal_rpc and not in tls defaults. Our server config has verify_incoming set in our tls defaults. We tested moving this configuration from tls defaults to internal_rpc and the controller can now connect.

https://github.com/hashicorp/terraform-aws-consul-ecs/blob/v0.8.0/modules/dev-server/main.tf#L373-L375

We thought that the server side auto_encrypt setting would automatically set the client side cert for the controller. Do you think this is a bug or does using tls defaults verify_incoming require client side certs that auto_encrypt can't provide?

This is what our working consul agent config looks like:

{
  "tls": {
    "defaults": {
      "verify_outgoing": true,
      "ca_file": "/consul/tls/certs/consul-agent-ca.pem",
      "cert_file": "/consul/tls/certs/server-consul-0.pem",
      "key_file": "/consul/tls/certs/server-consul-0-key.pem"
    },
    "internal_rpc": {
      "verify_incoming": true,
      "verify_server_hostname": true
    }
  },  
  "encrypt": "${gossip_key}",
  "primary_datacenter": "${primary_datacenter}",
  "connect": {
    "enabled": true
  },
  "auto_encrypt": {
    "allow_tls": true
  },
  "ports": {
    "http": 8500,
    "https": 8501,
    "grpc_tls": 8503
  }
}
Ganeshrockz commented 6 months ago

We thought that the server side auto_encrypt setting would automatically set the client side cert for the controller.

This isn't a bug and is expected. Setting this configuration in the server doesn't affect the controller's configuration because it is independent (and there is no client agent in the ECS task similar to the v0.6.x architecture)

loungerider commented 6 months ago

Great thanks