`az containerapp create` causes downtime in single revision mode

chinwobble commented 1 month ago

Please provide us with the following information:

This issue is a: (mark with an x)

[x] bug report -> please search issues before submitting
[ ] documentation issue or request
[ ] regression (a behavior that used to work and stopped in a new release)

Issue description

I am deploying an Azure Container App in single revision mode. I define how I want the app to work with healthchecks in yaml and I use yq to change the image tag when I want to release a new version. When I deploy the new app using az containerapp create I get 503s for a minute (I think while the new pod) is being made ready.

Steps to reproduce

. Create a yaml file

identity:
type: UserAssigned
userAssignedIdentities:
? /subscriptions/{subscriptionId}/resourcegroups/p-rg-platform-shared/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-acrpull
: {}
location: Australia East
properties:
configuration:
activeRevisionsMode: Single
ingress:
  allowInsecure: false
  clientCertificateMode: null
  corsPolicy: null
  exposedPort: 0
  external: true
  ipSecurityRestrictions: null
  stickySessions: null
  targetPort: 8080
  traffic:
    - latestRevision: true
      weight: 100
  transport: Auto
maxInactiveRevisions: null
service: null
registries:
  - identity: '/subscriptions/{subscriptionId}/resourcegroups/p-rg-platform-shared/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-acrpull'
    server: acrcsd.azurecr.io
template:
containers:
  - image: mycompany.azurecr.io/tools-app:1
    name: app
    resources:
      cpu: 0.25
      ephemeralStorage: 1Gi
      memory: 0.5Gi
    env:
      - name: DOTNET_ENVIRONMENT
        value: Staging
initContainers: null
revisionSuffix: ''
scale:
  maxReplicas: 1
  minReplicas: 1
  rules: null
serviceBinds: null
terminationGracePeriodSeconds: null
volumes: null
tags:
env: staging

Run the following command to deploy the yaml file

az containerapp create \
--name $appname \
--resource-group $RESOURCE_GROUP \
--environment "$managedEnvId" \
--subscription $SUBSCRIPTION_ID \
--yaml $transformed_yaml

Deploy a new version of your code with a new image tag.
```
docker push mycompany.azurecr.io/tools-app:2
```
Update the yaml file above with the image tag. We use sed in a bash script to read the value from a pipeline_run_id.
There is downtime

Expected behavior [What you expected to happen.] According to this page there is zero downtime https://learn.microsoft.com/en-us/azure/container-apps/revisions#zero-downtime-deployment

Actual behavior [What actually happened.] We get 503 error for a few seconds after az containerapp create has been run. I think this is the error message from envoy which is used internally by ACA.

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: 111

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

my app environment is in a custom VNET
Using azure cli

chinwobble commented 1 month ago

I've tested the app locally and inside docker. I have the port expose to 8080 when I navigate to localhost:8080/health I get a 200 response.

When deployed onto Azure Container Apps I keep getting these error logs:

Probe with executor HttpGetExecutor reached failure threshold 3, changing status to Failure.

I have tried changing the probe scheme to http and https and its not making any difference.

      probes:
      - type: liveness
        initialDelaySeconds: 10
        httpGet:
          path: "/health"
          scheme: "HTTP"
          port: 8080
      - type: readiness
        initialDelaySeconds: 10
        httpGet:
          path: "/health"
          scheme: "HTTP"
          port: 8080

My app is a simple aspnet core razor pages app. The container logs show

[06:54:00 INF] Now listening on: http://[::]:8080

My dockerfile has the standard template.

FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base
USER $APP_UID
WORKDIR /app
EXPOSE 8080
EXPOSE 8081

Though its my understand EXPOSE doesn't really do anything.

v-vish commented 1 month ago

@chinwobble I have successfully reproduced the issue by updating a container app using Nginx in the East US region, and the update was completed with zero downtime. To monitor the service status during the update, I used the watch and wget tools.

Could you try deploying a new container using the same setup but with a different image, such as Nginx, and provide an update on the results?

As for the second issue, could you adjust the period seconds value to something above 30 seconds and test again? Please let me know the outcome after making this change.

chinwobble commented 1 month ago

@chinwobble I have successfully reproduced the issue by updating a container app using Nginx in the East US region, and the update was completed with zero downtime. To monitor the service status during the update, I used the watch and wget tools.

Could you try deploying a new container using the same setup but with a different image, such as Nginx, and provide an update on the results?

As for the second issue, could you adjust the period seconds value to something above 30 seconds and test again? Please let me know the outcome after making this change.

thanks for looking into the issue for me. Was your container app in a custom vnet? I have setup the health probes and I can see the infra is trying to make the health probes but failing.

I will create a brand new app env and see what happens.

v-vish commented 1 month ago

@chinwobble Yes please create a brand new app env and let us know the status.

microsoft-github-policy-service[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

microsoft / azure-container-apps