hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.58k stars 4.62k forks source link

azurerm_kubernetes_cluster does not create an aks cluster with the apiserver authorized api ranges feature #9604

Closed phcaguiar closed 2 years ago

phcaguiar commented 3 years ago

Community Note

Terraform (and AzureRM Provider) Version

Affected Resource(s)

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key: https://keybase.io/hashicorp

data "azurerm_subnet" "subnet" {
  resource_group_name  = var.subnet_resource_group_name
  virtual_network_name = var.virtual_network_name
  name                 = var.subnet_name
}

resource "azurerm_kubernetes_cluster" "kubernetes_cluster" {
  name                = "aks-${lower(var.kubernetes_cluster_name)}"
  location            = var.location
  dns_prefix          = "api-aks-${lower(var.kubernetes_cluster_name)}"
  resource_group_name = var.resource_group_name
  kubernetes_version  = var.kubernetes_version
  default_node_pool {
    name                = var.default_node_pool_name
    node_count          = var.default_node_pool_node_count
    vm_size             = var.default_node_pool_vm_size
    max_pods            = var.default_node_pool_max_pods
    os_disk_size_gb     = var.default_node_pool_os_disk_size_gb
    min_count           = var.default_node_pool_min_count
    max_count           = var.default_node_pool_max_count
    enable_auto_scaling = var.default_node_pool_enable_auto_scaling
    type                = var.default_node_pool_type
    vnet_subnet_id      = data.azurerm_subnet.subnet.id
  }
  role_based_access_control {
    enabled = var.role_based_access_control_enabled
    azure_active_directory {
      client_app_id     = var.client_app_id
      server_app_id     = var.server_app_id
      server_app_secret = var.server_app_secret
      tenant_id         = var.tenant_id
      managed           = false
    }
  }
  api_server_authorized_ip_ranges = var.api_server_authorized_ip_ranges #[]
  service_principal {
    client_id     = var.service_principal_client_id
    client_secret = var.service_principal_client_secret
  }
  network_profile {
    network_plugin     = var.network_plugin
    docker_bridge_cidr = var.docker_bridge_cidr
    service_cidr       = var.service_cidr
    dns_service_ip     = var.dns_service_ip
    load_balancer_sku  = var.load_balancer_sku
    network_policy     = var.network_policy
  }
}

variable "subnet_resource_group_name" {
  description = "Specifies the name of the resource group the Virtual Network is located in."
}

variable "virtual_network_name" {
  description = "Specifies the name of the Virtual Network this Subnet is located within."
}

variable "subnet_name" {
  description = "Specifies the name of the Subnet."
}

variable "location" {
  description = "(Required) The location where the Managed Kubernetes Cluster should be created. Changing this forces a new resource to be created."
  default     = "eastus2"
}

variable "kubernetes_cluster_name" {
  description = "(Required) The name of the Managed Kubernetes Cluster to create. Changing this forces a new resource to be created."
}

variable "resource_group_name" {
  description = "(Required) The name of the resource group in which the Log Analytics workspace is created."
}

variable "kubernetes_version" {
  description = "(Required) Version of Kubernetes specified when creating the AKS managed cluster. If not specified, the latest recommended version will be used at provisioning time (but won't auto-upgrade)."
}

variable "default_node_pool_name" {
  description = "(Optional) The name which should be used for the default Kubernetes Node Pool. Changing this forces a new resource to be created."
  default     = "default"
}

variable "default_node_pool_node_count" {
  description = "(Required if default_node_pool_enable_auto_scaling variable is set to false). The initial number of nodes which should exist in this Node Pool. If specified this must be between 1 and 100."
  default     = null
}

variable "default_node_pool_vm_size" {
  description = "(Optional) The size of the Virtual Machine, such as Standard_DS2_v2."
  default     = "Standard_F8s_v2"
}

variable "default_node_pool_max_pods" {
  description = "(Optional) The maximum number of pods that can run on each agent. Changing this forces a new resource to be created."
  default     = "110"
}

variable "default_node_pool_os_disk_size_gb" {
  description = "(Optional) The size of the OS Disk which should be used for each agent in the Node Pool. Changing this forces a new resource to be created."
  default     = "40"
}

variable "default_node_pool_min_count" {
  description = "(Required if default_node_pool_enable_auto_scaling variable is set to true). The minimum number of nodes which should exist in this Node Pool. If specified this must be between 1 and 100."
  default     = null
}

variable "default_node_pool_max_count" {
  description = "(Required if default_node_pool_enable_auto_scaling variable is set to true) The maximum number of nodes which should exist in this Node Pool. If specified this must be between 1 and 100."
  default     = null
}

variable "default_node_pool_enable_auto_scaling" {
  description = "(Optional) Should the Kubernetes Auto Scaler be enabled for this Node Pool? Defaults to true."
  default     = "true"
}

variable "default_node_pool_type" {
  description = "(Optional) The type of Node Pool which should be created. Possible values are AvailabilitySet and VirtualMachineScaleSets. Defaults to VirtualMachineScaleSets."
  default     = "VirtualMachineScaleSets"
}

variable "role_based_access_control_enabled" {
  description = "(Optional) A role_based_access_control block. Changing this forces a new resource to be created."
  default     = "true"
}

variable "client_app_id" {
  description = "(Required) The Client ID of an Azure Active Directory Application."
}

variable "server_app_id" {
  description = "(Required) The Server ID of an Azure Active Directory Application."
}

variable "server_app_secret" {
  description = "(Required) The Server Secret of an Azure Active Directory Application."
}

variable "tenant_id" {
  description = "(Required) The Tenant ID used for Azure Active Directory Application. If this isn't specified the Tenant ID of the current Subscription is used."
}

variable "api_server_authorized_ip_ranges" {
  description = "(Optional) The IP ranges to whitelist for incoming traffic to the masters."
  default = [
    "201.17.87.61/32"
    ]
}

variable "addon_profile_oms_agent_enabled" {
  description = "(Required) Is the OMS Agent Enabled?"
  default     = "true"
}

variable "service_principal_client_id" {
  description = "(Required) The Client ID for the Service Principal."
}

variable "service_principal_client_secret" {
  description = "(Required) The Client Secret for the Service Principal."
}

variable "network_plugin" {
  description = "(Optional) Network plugin to use for networking. Currently supported values are azure and kubenet. Changing this forces a new resource to be created."
  default     = "azure"
}

variable "docker_bridge_cidr" {
  description = "(Optional) IP address (in CIDR notation) used as the Docker bridge IP address on nodes. This is required when network_plugin is set to azure. Changing this forces a new resource to be created."
  default     = "192.168.128.1/17"
}

variable "service_cidr" {
  description = "(Optional) The Network Range used by the Kubernetes service. This is required when network_plugin is set to azure. Changing this forces a new resource to be created."
  default     = "192.168.0.0/17"
}

variable "dns_service_ip" {
  description = "(Optional) IP address within the Kubernetes service address range that will be used by cluster service discovery (kube-dns). This is required when network_plugin is set to azure. Changing this forces a new resource to be created."
  default     = "192.168.0.10"
}

variable "load_balancer_sku" {
  description = "(Optional) Specifies the SKU of the Load Balancer used for this Kubernetes Cluster. Possible values are Basic and Standard. Defaults to Standard."
  default     = "Standard"
}

variable "network_policy" {
  description = "(Optional) Network plugin to use for networking. Currently supported values are azure and kubenet. Changing this forces a new resource to be created."
  default     = "azure"
}

provider "azurerm" {
  subscription_id = local.env["stg"].subscription_id
  version         = "2.34.0"
  features {}
}

module "aks" {
  source                                    = "./modules/aks"
  resource_group_name                       = local.env["stg"].resource_group_name
  kubernetes_cluster_name                   = local.env["stg"].aks_kubernetes_cluster_name
  subnet_resource_group_name                = local.env["stg"].aks_subnet_resource_group_name
  virtual_network_name                      = local.env["stg"].virtual_network_name
  subnet_name                               = local.env["stg"].aks_subnet_name
  location                                  = local.env["stg"].location
  client_app_id                             = local.env["stg"].client_app_id
  server_app_id                             = local.env["stg"].server_app_id
  tenant_id                                 = local.env["stg"].tenant_id
  server_app_secret                         = local.env["stg"].server_app_secret
  service_principal_client_id               = local.env["stg"].service_principal_client_id
  service_principal_client_secret           = local.env["stg"].service_principal_client_secret
  kubernetes_version                        = local.env["stg"].aks_kubernetes_version
  default_node_pool_min_count               = local.env["stg"].aks_default_node_pool_min_count
  default_node_pool_max_count               = local.env["stg"].aks_default_node_pool_max_count
  default_node_pool_max_pods                = local.env["stg"].aks_default_node_pool_max_pods
}

locals {
  env = {
    stg = {
      subscription_id                 = var.subscription_id
      resource_group_name             = "my-rg-name"
      aks_kubernetes_cluster_name     = "test"
      aks_subnet_resource_group_name  = "my-subnet-rg-name"
      virtual_network_name            = "my-vnet-name"
      aks_subnet_name                 = "my-aks-subnet-name"
      location                        = "eastus2"
      client_app_id                   = var.client_app_id
      server_app_id                   = var.server_app_id
      tenant_id                       = var.tenant_id
      service_principal_client_id     = var.service_principal_client_id
      aks_kubernetes_version          = "1.16.13"
      aks_default_node_pool_min_count = "1"
      aks_default_node_pool_max_count = "2"
      aks_default_node_pool_max_pods  = "31"
      server_app_secret               = var.server_app_secret
      service_principal_client_secret = var.service_principal_client_secret
    }
  }
}

Debug Output

https://gist.github.com/phcaguiar/91a1d3b0bf230f2f832847614991eabb

Panic Output

No panic

Expected Behaviour

The cluster must be created and with the whitelist of ips informed for the feature of apiserver authorized ip ranges without generating errors when executing the command

Actual Behaviour

Terraform creates the cluster with the whitelist applied, but returns an unexpected error output. It is only possible to apply the whitelist configuration without error when this step is performed after the cluster is created without this feature applied. See the error received:

Error: waiting for creation of Managed Kubernetes Cluster "aks-test" (Resource Group "FinancialSystems-Common-EC2-DEV"): Code="CreateVMSSAgentPoolFailed" Message="Unable to establish connection from agents to Kubernetes API server, please see https://aka.ms/aks-required-ports-and-addresses for more information. Details: VMSSAgentPoolReconciler retry failed: deployment operations failed with error messages: {\n \"code\": \"VMExtensionProvisioningError\",\n \"message\": \"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=51\\n[stdout]\\nMon Nov 30 21:11:56 UTC 2020,aks-default-64087038-vmss000000\\nConnection to mcr.microsoft.com 443 port [tcp/https] succeeded!\\n\\n[stderr]\\n\\\"\\r\\n\\r\\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \",\n \"details\": [\n {\n \"code\": \"VMExtensionProvisioningError\",\n \"message\": \"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=51\\n[stdout]\\nMon Nov 30 21:11:56 UTC 2020,aks-default-64087038-vmss000000\\nConnection to mcr.microsoft.com 443 port [tcp/https] succeeded!\\n\\n[stderr]\\n\\\"\\r\\n\\r\\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \"\n }\n ]\n } "

To be able to create the cluster with this functionality applied at the same time of its creation using null_resource

resource "null_resource" "main" {
   triggers = {
     api_server_authorized_ip_ranges = join (",", var.api_server_authorized_ip_ranges)
   }
   provisioner "local-exec" {
     command = "az aks update --resource-group $ {azurerm_kubernetes_cluster.kubernetes_cluster.resource_group_name} --name $ {azurerm_kubernetes_cluster.kubernetes_cluster.name} --api-server-authorized-ip-ranges $ {join (", ", var .api_server_authorized_ip_ranges)} "
   }
}

Steps to Reproduce

  1. terraform apply - with the api_server_authorized_ip_ranges feature
  2. terraform apply - without the api_server_authorized_ip_ranges feature
  3. terraform apply - with the null_resource configuration suggestion

Important Factoids

When creating a similar cluster but using the command az aks create the cluster is created without errors. It seems to me to be some terraform bug. Here is an example of a command used to create an aks cluster using the az aks create command:

az aks create \
     --resource-group FinancialSystems-Common-EC2-DEV \
     --name aks-test \
     --node-count 1 \
     --vm-set-type VirtualMachineScaleSets \
     --load-balancer-sku standard \
     --api-server-authorized-ip-ranges "201.17.87.61/32"\
     --network-plugin azure \
--network-policy azure \
--docker-bridge-address 192.168.128.1/17 \
--dns-service-ip 192.168.0.10 \
--service-cidr 192.168.0.0/17 \
     --generate-ssh-keys

References

Parameter and resource terraform used to create the cluster: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster#api_server_authorized_ip_ranges

ManyaSinghal commented 3 years ago

I am also facing same kind of error-

even though all the necessary ports are opened already

Error: waiting for creation of Managed Kubernetes Cluster "test-eastus2-poc7-k802" (Resource Group "test-eastus2-poc7-rg01"): Code="CreateVMSSAgentPoolFailed" Message="Unable to establish outbound connection from agents, please see https://aka.ms/aks-required-ports-and-addresses for more information. Details: VMSSAgentPoolReconciler retry failed: deployment operations failed with error messages: {\n \"code\": \"VMExtensionProvisioningError\",\n \"message\": \"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\"Enable failed: failed to execute command: command terminated with exit status=50\n[stdout]\n\n[stderr]\n\\"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \",\n \"details\": [\n {\n \"code\": \"VMExtensionProvisioningError\",\n \"message\": \"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\"Enable failed: failed to execute command: command terminated with exit status=50\n[stdout]\n\n[stderr]\n\\"\r\n\r\nMore information on troubleshooting is available at

on ../TerraformModules/Kubernetes/main.tf line 68, in resource "azurerm_kubernetes_cluster" "this": 68: resource "azurerm_kubernetes_cluster" "this" {

kriesto commented 3 years ago

Hello Everyone,

For those who are still facing this Error while using terrafom AKS module Code="CreateVMSSAgentPoolFailed" Message="AKS encountered an internal error while attempting the requested Creating operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit. Check back to see if the operation requires resubmission

try to set enable_host_encryption value to false, and try again

phcaguiar commented 2 years ago

@kriesto this parameter already receives the value "false" by default and this is already applied in our environment.

stephybun commented 2 years ago

Hi @phcaguiar,

Thanks for raising this issue and apologies for the wait. From the issue description my understanding is that the cluster creation with the whitelist is actually successful despite receiving an error from the AKS API, but since Terraform believes the deployment failed due to the error, the resource does not end up in the state and so the cluster cannot be managed thereafter. Does that summary sound correct? If so then I believe this is related to #9342 for which there is already an upstream issue https://github.com/Azure/AKS/issues/1972.

stephybun commented 2 years ago

Closing since we haven't heard back and because the cause of this issue is likely the same as what's described in #9342. Please subscribe to #9342 for updates. Thanks!

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.