Advertising an exit node via --advertise-exit-node stops working on subsequent terraform applications

kim-anchorzero commented 1 month ago

Describe the bug If you advertise a node via --advertise-exit-node in a terraform template that also defines resource "tailscale_device_subnet_routes" it will work on first launch but on second terraform run you will see a change such as:

  ~ resource "tailscale_device_subnet_routes" "this" {
      ~ device_id = "1278388442592801" -> (known after apply)
        id        = "c7db1330-f7f9-ac6e-75cb-e1bcf9b26f6d"
      ~ routes    = [
          - "0.0.0.0/0",
          - "::/0",
            # (1 unchanged element hidden)
        ]
    }

If applied the node will stop functioning as an exit node, being listed as requiring approval in the UI. If that permission is granted it will once again function.

To Reproduce Steps to reproduce the behaviour:

In terraform create a node who's TAILSCALE_UP_ARGS contains --advertise-exit-node
include a tailscale_device_subnet_routes resource that defines some block of routes but not the exit node routes
run the terraform, resulting in functioning exit node
run the terraform again, resulting in the exit node routes being removed from the node

Expected behaviour The functionality of the exit node is not impacted via subsequent terraform runs.

Desktop (please complete the following information):

OS: Mac, Linux
Terraform Version 1.6.7
Provider Version [e.g. 0.15.0]

Additional context The following (untested, likely incomplete) template should demonstrate the issue:

module "vpc" {
  source = "tfr:///terraform-aws-modules/vpc/aws?version=5.0.0"

  name = "vpc"

  cidr = "172.16.0.0/16"
  azs = [
    "us-east-1a",
    "us-east-1b"
  ]
  public_subnets = []
  private_subnets = ["172.16.1.0/24",
  "172.16.2.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false

  enable_vpn_gateway = false

  enable_dns_hostnames = true
  enable_dns_support   = true

  enable_flow_log                      = true
  create_flow_log_cloudwatch_log_group = true
  create_flow_log_cloudwatch_iam_role  = true
  flow_log_max_aggregation_interval    = 60
}

resource "aws_ssm_parameter" "state" {
  name  = "/state"
  type  = "SecureString"
  value = "{}"

  lifecycle {
    ignore_changes = [
      value
    ]
  }
}

data "aws_vpc" "default" {
  id = module.vpc.vpc_id
}

resource "aws_ecs_task_definition" "default" {
  family = "family"

  container_definitions = jsonencode([
    {
      name      = "tailscale"
      image     = "someimage"
      essential = true
      linuxParameters = {
        initProcessEnabled = true
      }
      environment = [
        {
          name  = "TAILSCALE_AUTHKEY"
          value = "key"
        },
        {
          name  = "TAILSCALE_STATE_PARAMETER_ARN"
          value = aws_ssm_parameter.state.value
        },
        {
          name  = "TAILSCALE_UP_ARGS"
          value = "--hostname=device-name --advertise-routes ${module.vpc.cidr_block} --advertise-exit-node"
        }
      ]
    }
  ])
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"

  cpu    = var.instance_cpu
  memory = var.instance_memory

  execution_role_arn = aws_iam_role.execution.arn
  task_role_arn      = aws_iam_role.task.arn

  tags = {
    Documentation = "Tailscale agent task"
  }
}

resource "aws_security_group" "this" {
  name   = "tailscale"
  vpc_id = data.aws_vpc.default.id

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
}

resource "aws_ecs_service" "default" {
  name            = "service"
  cluster         = "cluster"
  task_definition = aws_ecs_task_definition.default.arn
  launch_type     = "FARGATE"

  desired_count                      = 1
  deployment_minimum_healthy_percent = 0
  deployment_maximum_percent         = 100
  enable_execute_command             = true

  network_configuration {
    assign_public_ip = true

    subnets = [
      "172.16.1.0/24",
      "172.16.2.0/24"
    ]

    security_groups = concat([aws_security_group.this.id], var.additional_security_groups)
  }
}

data "tailscale_device" "this" {
  name       = "device-name"
  wait_for   = "60s"
  depends_on = [aws_ecs_service.default]
}

resource "tailscale_device_key" "this" {
  device_id = data.tailscale_device.this.id
}

resource "tailscale_device_subnet_routes" "this" {
  device_id = data.tailscale_device.this.id
  routes = [
    "172.16.1.0/24",
    "172.16.2.0/24",

    // without the following subsequent terraform runs will mark these as routes to be removed.  If applied exit node will fail
    "0.0.0.0/0",
    "::/0"
  ]
}

Note that if you explicitly add the exit node routes to the tailscale_device_subnet_routes block this issue doesn't happen. This lines up with the documentation for https://registry.terraform.io/providers/tailscale/tailscale/latest/docs/resources/device_subnet_routes which suggests these routes are how you should advertise an exit node - it's just somewhat surprising that doing so via the CLI args can have these unpredictable results.

mpminardi commented 1 month ago

Thank you for reporting this @kim-anchorzero!

As a clarifying question: in your above example after applying the configuration the first time are you also enabling / allowing the exit node from the admin console, or is that something you are only seeing / having to do after the second apply?

Having a device act as an exit node or subnet router is a two-step process that requires both advertising the routes , which is done exclusively via the CLI, and enabling the advertised routes, which can be done via a number of methods including the tailscale_device_subnet_routes resource or through the admin console as mentioned above.

The tailscale_device_subnet_routes deals exclusively with enabling routes and must be done in conjunction with the --advertise-exit-node or --advertise-routes flags. Enabling routes and exit nodes via the admin console (or autoApprovers if enabled) will enable these outside of Terraform state and cause the drift that you are seeing.

Our documentation for this resource is definitely sparse, I'll look at adding more clarification around the above / specifying more clearly what this resource is actually doing.

kim-anchorzero commented 1 month ago

We have auto approvers set up up for the device's tags and it works properly (automatically) on first attempt. Additionally, if I use the tailscale_device_subnet_routes workaround mentioned above it will never lose permission.

mpminardi commented 1 month ago

Gotcha! This is definitely a rough / awkward edge with using auto approvers in the ACL in combination with the tailscale_device_subnet_routes resource as I think there is contention between the two in tracking the state of the enabled routes.

Adding the "0.0.0.0/0" and "::/0" routes to the tailscale_device_subnet_routes is the correct path forward here from the Terraform perspective in preventing this drift and preventing the permission loss.

mpminardi commented 15 hours ago

Hey @kim-anchorzero , apologies for the long tail on improving the documentation for this! We've released v0.17.0 of the Terraform provider which has a (hopefully) clearer explanation of the usage / gotchas around using the resource (see here).

We've also updated the API documentation for the associated endpoint to hopefully be clearer.

tailscale / terraform-provider-tailscale

Advertising an exit node via --advertise-exit-node stops working on subsequent terraform applications #386