microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
362 stars 29 forks source link

Managed identity image pull failure on replica restart 24 hours after initial start #960

Open maskati opened 11 months ago

maskati commented 11 months ago

This issue is a: (mark with an x)

Issue description

ACA with (user assigned) managed identity authorized with AcrPull to ACR. Initial container startup is fine, and also works if the replica crashes and restarts within a day. If the replica crashes and restarts 24 hours after initial start, the restart fails with an image pull error.

Microsoft.App/containerApps/revisions state:

ContainerAppSystemLogs show repeating logs every 5 minutes:

In ACR ContainerRegistryRepositoryEvents there are no Pull operations logged as part of restart.

After restarting the revision things start working again, and this can be also seen as a successful image pull in ACR ContainerRegistryRepositoryEvents.

My assumption is that ACA is authenticating using MSI and acquiring a token for ACR authentication valid for 24 hours, but is not renewing the token after 24 hours has passed. Restarting the revision forces reacquisition of this token and image pull works.

Steps to reproduce

  1. ACA with managed identity authenticated image pull against ACR
  2. Wait 24 hours after initial start
  3. Container restart fails to pull image

Expected behavior Managed identity based image pull should work on replica restart regardless of how much time has passed since initial start.

Actual behavior Managed identity based image pull on replica restart fails 24 hours after initial start.

gianrubio commented 10 months ago

Unfortunately I'm seeing the same issue. One of our apis went down for the same reason, restarting the container fixes the issue. Is there anyone from azure to take a look on it?

bqstony commented 10 months ago

Could be the same issue in my case!

After healthcheck ProbeFailure, it couldn restart the container because of ImagePullFailure

my Terraform

// Identity to handle Environment access to other resources
resource "azurerm_user_assigned_identity" "containerapps_environment_identity" {
  name                  = "${data.azurerm_container_app_environment.containerapps_environment.name}-identity"
  location              = var.location
  resource_group_name   = azurerm_resource_group.dataprocessing_rg.name
  tags                  = var.tags_default
}

// Allow to pull images from the shared Container Registry
resource "azurerm_role_assignment" "containerapps_environment_identity_registry_rbacs" {
  scope                 = data.azurerm_container_registry.sharedprodacr.id
  // https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#acrpull
  role_definition_name  = "AcrPull"
  principal_id          = azurerm_user_assigned_identity.containerapps_environment_identity.principal_id

  depends_on = [ azurerm_user_assigned_identity.containerapps_environment_identity ]
}

resource "azapi_resource" "containerapps_environment_app_my-svc-prod" {
  type      = "Microsoft.App/containerApps@2023-05-01"
  name      = "my-svc-prod"
  parent_id = azurerm_resource_group.dataprocessing_rg.id
  location  = var.location
  tags      = var.tags_default

  body = jsonencode({
    properties = {
      environmentId = data.azurerm_container_app_environment.containerapps_environment.id
      workloadProfileName = "Consumption"
      configuration = {
        // In Single mode, a single revision is in operation at any given time.
        activeRevisionsMode: "Single"
        ingress = {
          external    = true
          // Container Target Port
          targetPort  = 80
          // http2 only does not work
          transport   = "auto"

          stickySessions = {
            affinity = "none" 
          }

          // custom domain bindings for Container Apps' hostnames.
          customDomains = [
            {
              bindingType = "SniEnabled"
              certificateId = data.azurerm_container_app_environment_certificate.containerapps_environment_certificate.id
              name = "my-svc-prod${var.environment_short == "prod" ? "" : "-${var.environment_short}"}.mydomain.com"
            }
          ]
        }
        registries = [
          {
            server = data.azurerm_container_registry.sharedprodacr.login_server
            // Resource ID for the User Assigned Managed identity to use when pulling from the Container Registry. Have to be added in identity_ids list
            identity = azurerm_user_assigned_identity.containerapps_environment_identity.id
          }
        ]
        secrets = [
          // ...
        ]
      }
      template = {
        containers = [
          {
            name  = "my-svc-prod"
            image = "sharedprodacr.azurecr.io/my-svc-prod:${var.environment_short}"
            resources = {
              cpu    = 1
              memory = "2Gi"
            }
            probes = [
              {
                type = "Liveness"
                httpGet = {
                  path = "/health"
                  port = 80
                  scheme = "HTTP"
                }
                periodSeconds = 10
              }
            ]
            env = [
              // ...
            ]
          }
        ]
        scale = {
          // Always 1 instance
          minReplicas = 1,
          maxReplicas = 1
        }
      }
    }
  })

  identity {
    type = "SystemAssigned, UserAssigned"
    identity_ids = [
      azurerm_user_assigned_identity.containerapps_environment_identity.id
    ]
  }

  depends_on = [
    //...
    azurerm_user_assigned_identity.containerapps_environment_identity
  ]
}

after 15 retry is has stopped. On the next day i have restart the revision manulay and it works insteed.

SourceSystem TimeGenerated [UTC] Computer RawData time_t [UTC] Error_s _timestamp_d EventSource_s Reason_s ReplicaName_s Type_s RevisionName_s EnvironmentName_s Log_s Count_d ContainerAppName_s Level TimeStamp_s Type
RestAPI 22.11.2023 09:16 22.11.2023 09:16 1700644592 ContainerAppController ContainerCreated my-svc-prod--eu2u6is-6c7ff49cb4-8f2d8 Normal my-svc-prod--eu2u6is blackbay-65a2960e Created container 'my-svc-prod' 1 my-svc-prod info 2023-11-22 09:16:31 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 22.11.2023 09:16 22.11.2023 09:16 1700644592 ContainerAppController ContainerStarted my-svc-prod--eu2u6is-6c7ff49cb4-8f2d8 Normal my-svc-prod--eu2u6is blackbay-65a2960e Started container 'my-svc-prod' 1 my-svc-prod info 2023-11-22 09:16:31 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 22.11.2023 09:16 22.11.2023 09:16 1700644591 ContainerAppController PulledImage my-svc-prod--eu2u6is-6c7ff49cb4-8f2d8 Normal my-svc-prod--eu2u6is blackbay-65a2960e Successfully pulled image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' in 9.8839824s 1 my-svc-prod info 2023-11-22 09:16:31 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 22.11.2023 09:16 22.11.2023 09:16 1700644580 ContainerAppController AssigningReplica my-svc-prod--eu2u6is-6c7ff49cb4-8f2d8 Normal my-svc-prod--eu2u6is blackbay-65a2960e Replica 'my-svc-prod--eu2u6is-6c7ff49cb4-8f2d8' has been scheduled to run on a node. 0 my-svc-prod info 2023-11-22 09:16:20 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:55 21.11.2023 22:55 1700607330 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 15 my-svc-prod info 2023-11-21 22:55:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:55 21.11.2023 22:55 1700607330 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 15 my-svc-prod info 2023-11-21 22:55:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:50 21.11.2023 22:50 1700607030 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 15 my-svc-prod info 2023-11-21 22:50:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:50 21.11.2023 22:50 1700607030 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 15 my-svc-prod info 2023-11-21 22:50:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:45 21.11.2023 22:45 1700606730 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 14 my-svc-prod info 2023-11-21 22:45:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:45 21.11.2023 22:45 1700606730 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 14 my-svc-prod info 2023-11-21 22:45:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:40 21.11.2023 22:40 1700606430 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 13 my-svc-prod info 2023-11-21 22:40:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:40 21.11.2023 22:40 1700606430 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 13 my-svc-prod info 2023-11-21 22:40:30 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:35 21.11.2023 22:35 1700606130 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 12 my-svc-prod info 2023-11-21 22:35:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:35 21.11.2023 22:35 1700606130 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 12 my-svc-prod info 2023-11-21 22:35:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:30 21.11.2023 22:30 1700605830 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 11 my-svc-prod info 2023-11-21 22:30:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:30 21.11.2023 22:30 1700605830 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 11 my-svc-prod info 2023-11-21 22:30:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:25 21.11.2023 22:25 1700605529 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 10 my-svc-prod info 2023-11-21 22:25:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:25 21.11.2023 22:25 1700605530 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 10 my-svc-prod info 2023-11-21 22:25:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:20 21.11.2023 22:20 1700605229 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 9 my-svc-prod info 2023-11-21 22:20:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:20 21.11.2023 22:20 1700605229 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 9 my-svc-prod info 2023-11-21 22:20:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:15 21.11.2023 22:15 1700604929 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 8 my-svc-prod info 2023-11-21 22:15:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:15 21.11.2023 22:15 1700604929 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 8 my-svc-prod info 2023-11-21 22:15:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:10 21.11.2023 22:10 1700604629 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 7 my-svc-prod info 2023-11-21 22:10:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:10 21.11.2023 22:10 1700604629 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 7 my-svc-prod info 2023-11-21 22:10:29 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:05 21.11.2023 22:05 1700604329 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 6 my-svc-prod info 2023-11-21 22:05:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:05 21.11.2023 22:05 1700604329 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 6 my-svc-prod info 2023-11-21 22:05:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:00 21.11.2023 22:00 1700604029 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 5 my-svc-prod info 2023-11-21 22:00:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 22:00 21.11.2023 22:00 1700604029 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 5 my-svc-prod info 2023-11-21 22:00:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:57 21.11.2023 21:57 1700603868 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 4 my-svc-prod info 2023-11-21 21:57:48 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:57 21.11.2023 21:57 1700603868 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 4 my-svc-prod info 2023-11-21 21:57:48 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:56 21.11.2023 21:56 1700603788 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 3 my-svc-prod info 2023-11-21 21:56:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:56 21.11.2023 21:56 1700603788 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 3 my-svc-prod info 2023-11-21 21:56:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:55 21.11.2023 21:55 1700603748 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 2 my-svc-prod info 2023-11-21 21:55:48 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:55 21.11.2023 21:55 1700603748 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 2 my-svc-prod info 2023-11-21 21:55:48 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:55 21.11.2023 21:55 1700603728 ContainerAppController PullingImage my-svc-prod--eu2u6is-846bcdff67-fvf74 Normal my-svc-prod--eu2u6is blackbay-65a2960e Pulling image 'sharedacr.azurecr.io/my-svc-prod-api:cb06b467b2019eb46bf1cb2677c164fe6150c99c' 1 my-svc-prod info 2023-11-21 21:55:27 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:55 21.11.2023 21:55 1700603728 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ImagePullFailure' 1 my-svc-prod info 2023-11-21 21:55:28 +0000 UTC ContainerAppSystemLogs_CL
RestAPI 21.11.2023 21:55 21.11.2023 21:55 1700603715 ContainerAppController ContainerTerminated my-svc-prod--eu2u6is-846bcdff67-fvf74 Warning my-svc-prod--eu2u6is blackbay-65a2960e Container 'my-svc-prod' was terminated with exit code '' and reason 'ProbeFailure' 1 my-svc-prod info 2023-11-21 21:55:15 +0000 UTC ContainerAppSystemLogs_CL
maskati commented 10 months ago

I might also be incorrect about the time to failure, could in fact be 1 hour instead of 24 hours. I don't have the patience to wait long periods when testing things.

chinadragon0515 commented 9 months ago

We identified an issue when customer app is running on consumption workload profile with managed identity used, if container exit, it will not come up again, we have fixed issue, and the fix has deployed to all regions.

Let us know if you still see issues.

maskati commented 9 months ago

@chinadragon0515 seems to be working now, thanks!

jellehellmann commented 8 months ago

@chinadragon0515 since last week we're experiencing this issue again in WestEurope. Any chance it was somehow re-introduced ?

dtcos commented 8 months ago

hi, experiencing this issue as of last week, started on 13th Jan at 8.30am UTC, my setup is a container app environment and a container app running in west europe (Workload profiles)

klemmchr commented 8 months ago

Having the same issue since two days. This is urgent. Container apps are just failing randomly to pull images on production.

@chinadragon0515 could you have a look at this?

davidkarlsen commented 8 months ago

WTF!?

maskati commented 8 months ago

Experiencing the same ATM.

nimro commented 8 months ago

Seeing this in UK South too. You can see our container was OOM killed at 03:59:23 and then followed an hour of pulling image/ImagePullFailure messages (I only included the first few).

image

ericxl commented 8 months ago

Please fix this. this is causing major downtimes for our servers in us west2. any workaround?

klemmchr commented 8 months ago

Please fix this. this is causing major downtimes for our servers in us west2. any workaround?

The only known workaround is to not use managed identity but tokens.

ericxl commented 8 months ago

Please fix this. this is causing major downtimes for our servers in us west2. any workaround?

The only known workaround is to not use managed identity but tokens.

Thanks. but we are using admin roles not managed identities.

maskati commented 8 months ago

I have only experienced this issue with managed identity authenticated image pull, and is also the topic of this bug report. If you are experiencing issues in other contexts then it would probably be advisable to create a separate issue.

bqstony commented 8 months ago

in chn also not working

chinadragon0515 commented 8 months ago

@bqstony @maskati @ericxl @klemmchr @nimro @dtcos @jellehellmann @davidkarlsen We did not aware any known issue now. can you please send us an email to acasupport at microsoft dot com with your containerapp, env info so we could follow up with you?

davidkarlsen commented 8 months ago

@chinadragon0515 Will it be debugged properly, or the usual mindtree ltd support (then I cannot be bothered)? Sent an email just now - hopefully we can get to the bottom of this...

chinadragon0515 commented 8 months ago

All, we investigated the issue and have identified the root cause, we did a long-term fix, but part of the fix is not deployed yet and cause the regression. We are working to revert to short term fix and expect to deploy to all regions in next two days.

In the meanwhile, we already setup auto detect and mitigation workflow, all impacts container app should have auto mitigated.

Let me know if you still see the issue. Sorry for inconvenience. thanks

maskati commented 8 months ago

@chinadragon0515 at least the mitigation has not fixed failed replicas (1/3 replicas is still down due to the issue). I will perform a restart which generally fixes the issue for some time, usually 24 hours. I will report back if the issue reoccurs on fresh replicas.

chinadragon0515 commented 8 months ago

@maskati can you please send us an email to acasupport at microsoft dot com with your containerapp, env info ? I want to check why it is not mitigated? whether it is same issue.

Note the issue I mentioned is when MSI is used to pull image and somehow the replica of container app terminated like OOM, then the replica could stuck in bad state.

If you do not use MSI, then it will be a different issue and you can send your container app and env info to us to investigate more.

Chris-Sheridan commented 7 months ago

I am having the same issue. I had a container app job that worked a few days ago, but now it it's unable to pull the image down from my private ACR with the reason of ImagePullFailure. I deleted the job and created a fresh one but received the same error. I have the job connecting to the private ACR with admin credentials. Any thoughts?

Chris-Sheridan commented 7 months ago

Update on my end. I created a managed identity, gave it push/pullACR rights, and had the job use that ID to pull the container. It then worked. I checked the job again and it automatically switched it to admin, but it's working now. Really odd.

jehell25 commented 7 months ago

@chinadragon0515 thanks for the fix. On our systems the fix has mitigated the total crash of the replica after 15 image-pull errors.

We're still seeing some image-pull errors on some environments in the logs, but with datetime of 2024-01-25T06:42:14.4108994Z and a failure count_d up to 6 or 7. Using this KQL to query for the image-pull issues: ContainerAppSystemLogs_CL | where Log_s contains "was terminated with exit code '' and reason 'ImagePullFailure'" and Count_d > 0

We had one container app running multiple replicas (3) where more pull-error happened also up till today. We recreated this one and now it seems to work.

mateoscarlos commented 7 months ago

I'm still having the same issue.

1/1 Pending:ImagePullBackOff on legion when I try to get the image from a private registry (this registry is in another resource group by the way, but I send the credentials) Same issue when I run az containerapp up with --source argument instead of image.

Do we have any news? @chinadragon0515


24/02/08: Issue solved.

StGunneR commented 6 months ago

Hello, the issue still exists to me...UK South Region.

JonasSamuelsson commented 6 months ago

We are also seeing image pull failures in apps using managed identity. Multiple container app environments, all in West Europe.

image

image

technight commented 6 months ago

Also seeing the "Pending:ImagePullBackOff on legion" in West Europe across multiple container apps right now. We are using user assigned managed identity for ACR pull rights and using Consumption Workload Profile in the app environment. It fails immediately after deployment.

jehell25 commented 6 months ago

Same for us in West Europe. Multiple fails after deployment

nimro commented 6 months ago

We've also been seeing the post-deployment image pull failures that @JonasSamuelsson, @technight, and @jehell25 mentioned. It's not the same behaviour as the lead post for this issue: it happens immediately during deployment of a new revision with a new image tag, rather than after ~24 hours as before.

Using ACR with managed identity auth. Deployments via Azure Pipelines task.

Restarting the failed revision manually does allow it to successfully start up and pull the image, so the issue appears to be isolated to the initial deployment.

vinisoto commented 6 months ago

Hi - we have identified a race condition as the root cause of the issue and are in the process of producing and rolling out a fix.

Details:

The root cause of the issue with users experiencing ImagePullBackoff errors is a race condition within the platform impacting specifically apps running in the Consumption workload profile. This condition occurs when the system inaccurately updates the in-memory token, intended for image pulls, to an empty value for some replicas during the replica creation.

Will update this issue once the rollout with the fix is completed.

nbwdk commented 6 months ago

We are also experiencing this on consumption based Container app jobs, with admin credentials in North Europe. If we run a single execution it works, but when running multiple execution, the majority will fail in pulling the image. Same experience if running single execution in Parallelism.

Any update on the fix?

donran commented 4 months ago

Hello, we are waiting for this as well. Any update on ETA?

Jordan-Murray commented 2 months ago

We're also experiencing this issue, only arising after a .NET 8 Upgrade on the AppService/ Dockerfile. Any more information/ ETA? @vinisoto

BossensM commented 2 months ago

We are also experiencing the same issue

klemmchr commented 2 months ago

We are experiencing this issue from time to time using Container App Jobs. We have a scheduled execution that runs once per hour. Most runs are fine but on some days there are single or even multiple runs failing because they cannot pull the image.

tobiasholzner-whiteduck commented 3 weeks ago

We are also experiencing this issue for one of our container app environments. @vinisoto when can we expect the problem to be solved?