hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.51k stars 4.6k forks source link

Azurerm_linux_web_app deploying updated docker images results in app service being unhealthy on one instance for at least two hours #26870

Open Marc-fra76 opened 1 month ago

Marc-fra76 commented 1 month ago

Is there an existing issue for this?

Community Note

Terraform Version

1.3.3

AzureRM Provider Version

3.97.1

Affected Resource(s)/Data Source(s)

azurerm_linux_web_app

Terraform Configuration Files

locals {
  pe_tags = { "Kind" = "Private EndPoint" }
}

# creation de l'app service au format linux
resource "azurerm_linux_web_app" "this" {
  name                = "${var.prefix}-${var.name}"
  service_plan_id     = var.service_plan_id
  resource_group_name = var.resource_group_name
  location            = var.location

  virtual_network_subnet_id = var.virtual_network_subnet_id
  https_only                = var.https_only

  site_config {
    always_on                         = var.always_on
    vnet_route_all_enabled            = var.vnet_route_all_enabled
    minimum_tls_version               = var.minimum_tls_version
    ftps_state                        = var.ftps_state
    http2_enabled                     = var.http2_enabled
    ip_restriction_default_action     = var.ip_restriction_default_action
    scm_ip_restriction_default_action = var.scm_ip_restriction_default_action

    application_stack {
      docker_image_name        = "${var.docker_image}:${var.docker_image_tag}"
      dotnet_version           = var.dotnet_version
      docker_registry_url      = var.docker_registry_url
      docker_registry_username = var.docker_registry_username
      docker_registry_password = var.docker_registry_password
    }

    cors {
      allowed_origins = var.allowed_origins
    }

    health_check_path = var.health_check_path
    health_check_eviction_time_in_min = var.health_check_path != "" ? var.health_check_eviction_time_in_min : null
    app_command_line  = var.app_command_line

    dynamic "ip_restriction" {
      for_each = var.ip_restrictions
      content {
        action                    = can(ip_restriction.value["action"]) ? ip_restriction.value["action"] : null
        ip_address                = can(ip_restriction.value["ip_address"]) ? ip_restriction.value["ip_address"] : null
        name                      = can(ip_restriction.value["name"]) ? ip_restriction.value["name"] : null
        priority                  = can(ip_restriction.value["priority"]) ? ip_restriction.value["priority"] : null
        service_tag               = can(ip_restriction.value["service_tag"]) ? ip_restriction.value["service_tag"] : null
        virtual_network_subnet_id = can(ip_restriction.value["virtual_network_subnet_id"]) ? ip_restriction.value["virtual_network_subnet_id"] : null
        dynamic "headers" {
          for_each = ip_restriction.value["headers"] == null ? [] : [1]
          content {
            x_azure_fdid      = can(ip_restriction.value["headers"].x_azure_fdid) ? ip_restriction.value["headers"].x_azure_fdid : null
            x_fd_health_probe = can(ip_restriction.value["headers"].x_fd_health_probe) ? ip_restriction.value["headers"].x_fd_health_probe : null
            x_forwarded_for   = can(ip_restriction.value["headers"].x_forwarded_for) ? ip_restriction.value["headers"].x_forwarded_for : null
            x_forwarded_host  = can(ip_restriction.value["headers"].x_forwarded_host) ? ip_restriction.value["headers"].x_forwarded_host : null
          }
        }
      }
    }

    dynamic "scm_ip_restriction" {
      for_each = var.scm_ip_restrictions
      content {
        action                    = can(scm_ip_restriction.value["action"]) ? scm_ip_restriction.value["action"] : null
        ip_address                = can(scm_ip_restriction.value["ip_address"]) ? scm_ip_restriction.value["ip_address"] : null
        name                      = can(scm_ip_restriction.value["name"]) ? scm_ip_restriction.value["name"] : null
        priority                  = can(scm_ip_restriction.value["priority"]) ? scm_ip_restriction.value["priority"] : null
        service_tag               = can(scm_ip_restriction.value["service_tag"]) ? scm_ip_restriction.value["service_tag"] : null
        virtual_network_subnet_id = can(scm_ip_restriction.value["virtual_network_subnet_id"]) ? scm_ip_restriction.value["virtual_network_subnet_id"] : null
        dynamic "headers" {
          for_each = scm_ip_restriction.value["headers"] == null ? [] : [1]
          content {
            x_azure_fdid      = can(scm_ip_restriction.value["headers"].x_azure_fdid) ? scm_ip_restriction.value["headers"].x_azure_fdid : null
            x_fd_health_probe = can(scm_ip_restriction.value["headers"].x_fd_health_probe) ? scm_ip_restriction.value["headers"].x_fd_health_probe : null
            x_forwarded_for   = can(scm_ip_restriction.value["headers"].x_forwarded_for) ? scm_ip_restriction.value["headers"].x_forwarded_for : null
            x_forwarded_host  = can(scm_ip_restriction.value["headers"].x_forwarded_host) ? scm_ip_restriction.value["headers"].x_forwarded_host : null
          }
        }
      }
    }

    scm_use_main_ip_restriction = var.scm_use_main_ip_restriction
    scm_minimum_tls_version     = var.scm_minimum_tls_version
  }

  dynamic "identity" {
    for_each = var.identities

    content {
      type = identity.key
      identity_ids = identity.value
    }
  }

  dynamic "storage_account" {
    for_each = var.storage_accounts
    content {
      name         = storage_account.value["name"]
      type         = storage_account.value["type"]
      share_name   = storage_account.value["share_name"]
      account_name = storage_account.value["account_name"]
      access_key   = storage_account.value["access_key"]
      mount_path   = storage_account.value["mount_path"]
    }
  }

  logs {
    http_logs {
      file_system {
        retention_in_days = var.retention_in_days
        retention_in_mb   = var.retention_in_mb
      }
    }
  }

  app_settings = var.app_settings

  tags = var.tags

  lifecycle {
    ignore_changes = [ 
      site_config[0].cors,
      logs[0].http_logs[0].file_system[0].retention_in_days,
      tags
     ]
  }
}

# Ajout d'un private endpoint pour l'application mytt
resource "azurerm_private_endpoint" "this" {
  name                = "${var.prefix}-${var.name}-${var.private_endpoint_suffix}"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.pe_subnet_id

  tags = merge(var.tags, local.pe_tags)

  private_dns_zone_group {
    name                 = var.private_dns_zone_group_name
    private_dns_zone_ids = var.private_dns_zone_ids
  }

  private_service_connection {
    name                           = "${var.prefix}-${var.name}-${var.service_connexion_suffix}"
    is_manual_connection           = var.is_manual_connection
    private_connection_resource_id = azurerm_linux_web_app.this.id
    subresource_names              = ["sites"]
  }

  dynamic "ip_configuration" {
    for_each = var.private_ip_address == "" ? {} : { ip_configuration = true }
    content {
      name               = "${var.prefix}-${var.name}-static-ipconfig-${var.service_connexion_suffix}"
      private_ip_address = var.private_ip_address
      subresource_name   = "sites"
    }
  }
}

resource "azurerm_role_assignment" "this" {
  for_each = var.role_assignments
  depends_on = [
    azurerm_linux_web_app.this
  ]
  scope                = each.value
  role_definition_name = each.key
  principal_id         = azurerm_linux_web_app.this.identity.0.principal_id
}

Debug Output/Panic Output

# module.app_service_consumer["app-smouv-gedmouv-client-v1"].azurerm_linux_web_app.this will be updated in-place
  ~ resource "azurerm_linux_web_app" "this" {
        id                                             = "/subscriptions/***/resourceGroups/***/providers/Microsoft.Web/sites/fraprod-app-smouv-gedmouv-client-v1"
        name                                           = "fraprod-app-smouv-gedmouv-client-v1"
      ~ public_network_access_enabled                  = false -> true
        tags                                           = {
            "Kind"                       = "App Service Consumer"
            "ManagedBy"                  = "SOGET Track And Trace Team"
            "Owner"                      = "trackandtrace-team@soget.fr"
            "Project"                    = "Terraform provisionning"
            "environment"                = "API PROD"
            "hidden-link: acrResourceId" = jsonencode(
                {
                    subscriptionId = "****"
                }
            )
        }
        # (24 unchanged attributes hidden)

      ~ site_config {
          ~ health_check_eviction_time_in_min             = 0 -> 10
            # (28 unchanged attributes hidden)

          ~ application_stack {
              ~ docker_image_name        = "gedmouv-client/gedmouv-client:1.5.0-beta.143" -> "gedmouv-client/gedmouv-client:1.5.0-beta.144"
                # (14 unchanged attributes hidden)
            }
        }

        # (1 unchanged block hidden)
    }

Expected Behaviour

Deploy an updated image to the app service to all the instances of the app service plan used by the app service

Actual Behaviour

After upgrading the azurerm provider from version 3.38.0 to 3.97.1, I run into unstable app services after deployement. I had to increase the app service plan instances to at least 2 in order to keep our app services up.

The deployement went fine, but the app service is detected as unhealthy on one instance. This unhealthy state takes around two hours to be resolved automatically by the system. During this unhealthy state, several instances are created (around 10 instances) and each tries to start the application.

Here is the memory usage graph of our last deployement: memory_usage_per_instance

Steps to Reproduce

simply apply a new image update on an existing app service

Important Factoids

No response

References

No response

Chambras commented 1 month ago

@Marc-fra76 Thanks for submitting this, but it seems this is not an azurerm provider issue. Seems to be an Azure capacity issue. Have you tried using a different region? Sometimes it also helps using a different SKU. The code does not show which application plan you are using.

Marc-fra76 commented 1 month ago

@Marc-fra76 Thanks for submitting this, but it seems this is not an azurerm provider issue. Seems to be an Azure capacity issue. Have you tried using a different region? Sometimes it also helps using a different SKU. The code does not show which application plan you are using.

This problem appeared after I have upgraded the azurerm provider version. I was able to deploy two times with this upgrade on our staging environment before doing it on our production environnement. I also have opened an azure ticket to try to find out where the problem occurs. Their reply are simple : deploy less app services at once, make a more sequential deployement, eventually split them on multiple app service plan.

I'm still using the 3.38.0 version on another infra for the moment, and I don't have this problem. I had planned to upgrade it too, but at the moment, I'm not willing to do it.

The app service plan sku is P2V3 with normally allow the use of 32 app services. I have only 20 app services on it, and simply updating 6 of them give me this result.

I have also added the lifecycle > ignore_changes part to reduce the updates on the app services at each plan. I made this change after reading this issue https://github.com/hashicorp/terraform-provider-azurerm/issues/22879 But it didn't changes much. I have just reduced the number of app services to be updated from all to just the one requested.