microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
372 stars 29 forks source link

`az containerapp create` causes downtime in single revision mode #1305

Closed chinwobble closed 2 weeks ago

chinwobble commented 1 month ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

I am deploying an Azure Container App in single revision mode. I define how I want the app to work with healthchecks in yaml and I use yq to change the image tag when I want to release a new version. When I deploy the new app using az containerapp create I get 503s for a minute (I think while the new pod) is being made ready.

Steps to reproduce

  1. . Create a yaml file
    identity:
    type: UserAssigned
    userAssignedIdentities:
    ? /subscriptions/{subscriptionId}/resourcegroups/p-rg-platform-shared/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-acrpull
    : {}
    location: Australia East
    properties:
    configuration:
    activeRevisionsMode: Single
    ingress:
      allowInsecure: false
      clientCertificateMode: null
      corsPolicy: null
      exposedPort: 0
      external: true
      ipSecurityRestrictions: null
      stickySessions: null
      targetPort: 8080
      traffic:
        - latestRevision: true
          weight: 100
      transport: Auto
    maxInactiveRevisions: null
    service: null
    registries:
      - identity: '/subscriptions/{subscriptionId}/resourcegroups/p-rg-platform-shared/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-acrpull'
        server: acrcsd.azurecr.io
    template:
    containers:
      - image: mycompany.azurecr.io/tools-app:1
        name: app
        resources:
          cpu: 0.25
          ephemeralStorage: 1Gi
          memory: 0.5Gi
        env:
          - name: DOTNET_ENVIRONMENT
            value: Staging
    initContainers: null
    revisionSuffix: ''
    scale:
      maxReplicas: 1
      minReplicas: 1
      rules: null
    serviceBinds: null
    terminationGracePeriodSeconds: null
    volumes: null
    tags:
    env: staging
  2. Run the following command to deploy the yaml file
    az containerapp create \
    --name $appname \
    --resource-group $RESOURCE_GROUP \
    --environment "$managedEnvId" \
    --subscription $SUBSCRIPTION_ID \
    --yaml $transformed_yaml
  3. Deploy a new version of your code with a new image tag.
    docker push mycompany.azurecr.io/tools-app:2
  4. Update the yaml file above with the image tag. We use sed in a bash script to read the value from a pipeline_run_id.
  5. There is downtime

Expected behavior [What you expected to happen.] According to this page there is zero downtime https://learn.microsoft.com/en-us/azure/container-apps/revisions#zero-downtime-deployment

Actual behavior [What actually happened.] We get 503 error for a few seconds after az containerapp create has been run. I think this is the error message from envoy which is used internally by ACA.

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: 111

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

chinwobble commented 1 month ago

I've tested the app locally and inside docker. I have the port expose to 8080 when I navigate to localhost:8080/health I get a 200 response.

When deployed onto Azure Container Apps I keep getting these error logs:

Probe with executor HttpGetExecutor reached failure threshold 3, changing status to Failure.

I have tried changing the probe scheme to http and https and its not making any difference.

      probes:
      - type: liveness
        initialDelaySeconds: 10
        httpGet:
          path: "/health"
          scheme: "HTTP"
          port: 8080
      - type: readiness
        initialDelaySeconds: 10
        httpGet:
          path: "/health"
          scheme: "HTTP"
          port: 8080

My app is a simple aspnet core razor pages app. The container logs show

[06:54:00 INF] Now listening on: http://[::]:8080

My dockerfile has the standard template.

FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base
USER $APP_UID
WORKDIR /app
EXPOSE 8080
EXPOSE 8081 

Though its my understand EXPOSE doesn't really do anything.

v-vish commented 1 month ago

@chinwobble I have successfully reproduced the issue by updating a container app using Nginx in the East US region, and the update was completed with zero downtime. To monitor the service status during the update, I used the watch and wget tools.

Could you try deploying a new container using the same setup but with a different image, such as Nginx, and provide an update on the results?

As for the second issue, could you adjust the period seconds value to something above 30 seconds and test again? Please let me know the outcome after making this change.

chinwobble commented 1 month ago

@chinwobble I have successfully reproduced the issue by updating a container app using Nginx in the East US region, and the update was completed with zero downtime. To monitor the service status during the update, I used the watch and wget tools.

Could you try deploying a new container using the same setup but with a different image, such as Nginx, and provide an update on the results?

As for the second issue, could you adjust the period seconds value to something above 30 seconds and test again? Please let me know the outcome after making this change.

thanks for looking into the issue for me. Was your container app in a custom vnet? I have setup the health probes and I can see the infra is trying to make the health probes but failing.

I will create a brand new app env and see what happens.

v-vish commented 1 month ago

@chinwobble Yes please create a brand new app env and let us know the status.

microsoft-github-policy-service[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.