Container revisions will not start

AndyThurgood commented 2 weeks ago

Please provide us with the following information:

This issue is a: (mark with an x)

[ x ] bug report -> please search issues before submitting
[ ] documentation issue or request
[ x ] regression (a behavior that used to work and stopped in a new release)

Issue description

A clear and concise description of the observed issue.

We appear to be seeing the same issue as raised in this issue last week.

We have 4 Container app environments, and since this morning none of our app (20+ per env) are able to start a new revision. This started originally in our test environments as those container apps scale to zero and when traffic hit those environments, none of the apps were able to start.

It is now impacting our production environment, as those apps scale down overnight.

The limited logs that are generated on a revision are as per below, every 2 minutes it looks like the environment attempts to assign a replica, until the startup process fails.


{"TimeStamp":"2024-07-10 20:15:20 \u002B0000 UTC","Type":"Normal","ContainerAppName":"#########","RevisionName":"#########--r42bar6","ReplicaName":"#########--r42bar6-59b587c955-xph42","Msg":"Replica \u0027#########--r42bar6-59b587c955-xph42\u0027 has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}
{"TimeStamp":"2024-07-10 20:17:21 \u002B0000 UTC","Type":"Normal","ContainerAppName":"#########","RevisionName":"#########--r42bar6","ReplicaName":"#########--r42bar6-59b587c955-h6rhf","Msg":"Replica \u0027#########--r42bar6-59b587c955-h6rhf\u0027 has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}
{"TimeStamp":"2024-07-10 20:19:22 \u002B0000 UTC","Type":"Normal","ContainerAppName":"#########","RevisionName":"#########--r42bar6","ReplicaName":"#########--r42bar6-59b587c955-qz9wn","Msg":"Replica \u0027#########--r42bar6-59b587c955-qz9wn\u0027 has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}
{"TimeStamp":"2024-07-10 20:21:23 \u002B0000 UTC","Type":"Normal","ContainerAppName":"#########","RevisionName":"#########--r42bar6","ReplicaName":"#########--r42bar6-59b587c955-h2jnt","Msg":"Replica \u0027#########--r42bar6-59b587c955-h2jnt\u0027 has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}
{"TimeStamp":"2024-07-10 20:23:25 \u002B0000 UTC","Type":"Normal","ContainerAppName":"#########","RevisionName":"#########--r42bar6","ReplicaName":"#########--r42bar6-59b587c955-lvjkl","Msg":"Replica \u0027#########--r42bar6-59b587c955-lvjkl\u0027 has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}

If we tweak the scale rules to force at least one active revision, we see a new revision spawn, but the new revision will not start, and it appears that no replica ever gets assigned to the replica. The previous revision is never removed, as per below:

Its worth noting that we haven't made any changes to our images for these services, and we haven't seen any issues in these environments in the last 6 months.

We have also tried destroying a test environment, and recreating, which hasn't resolved the issue.

Steps to reproduce

N/A
N/A

Expected behavior [What you expected to happen.]

We expect that a container instance/revision should start

Actual behavior [What actually happened.]

No container apps are able to start

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

Ex. Did this issue occur in the CLI or the Portal?

Azure Environment: UK South Workload Profile: Consumption (4vCPU / 8GB memory)

simonjj commented 2 weeks ago

Thanks for raising this. Can you please send the blocked out content to acasupport at microsoft.com. We would need your subscription, app and environment name please.

AndyThurgood commented 2 weeks ago

acasupport at microsoft.com

Hi @simonjj I've sent that detail across. Thanks

dyamo commented 2 weeks ago

I'm having the same issue, also in UK South.

rajravat commented 2 weeks ago

I am also seeing the same problem on multiple container apps in the uksouth region

simonjj commented 2 weeks ago

Just to update the thread. We're investigating this issue and will update once it's resolved.

MathieuDiepman commented 2 weeks ago

same issue here in West Europe

MathieuDiepman commented 2 weeks ago

in our case reducing the max number of container instances resolves the problem. With 3 instances we would end up with 6 cores, which is more than the allowed 4: With max 2 container instances, both container instances start without any problems.

MathieuDiepman commented 2 weeks ago

scratch that thought, 3 instances running now: Looks like something got fixed in Azure

jsheetzmt commented 2 weeks ago

Any update on this? This is a major problem..

rauschp commented 2 weeks ago

I'm seeing something very similar.. specifically errors such as this:

{"TimeStamp":"2024-07-12 14:46:40 \u002B0000 UTC","Type":"Normal","ContainerAppName":"{CONTAINERAPPNAME}","RevisionName":"{REVISIONNAME}","ReplicaName":"{REPLICANAME}","Msg":"Replica {REPLICANAME} has been scheduled to run on a node.","Reason":"AssigningReplica","EventSource":"ContainerAppController","Count":0}
{"TimeStamp":"2024-07-12 14:47:23 \u002B0000 UTC","Type":"Warning","ContainerAppName":"{CONTAINERNAME}","RevisionName":"{REVISIONNAME}","ReplicaName":"","Msg":"ScaledObject doesn\u0027t have correct triggers specification","Reason":"ScaledObjectCheckFailed","EventSource":"KEDA","Count":9}

While terraform is used to deploy the containers, nothing has been changed with min/max replicas. Everything that has been deployed/changed since July 9th, the new revision isn't deploying and sits in a state of activating showing 0/0 ready with -Infinity restarts.

Increasing to 1 min/2 max did not fix my issue

Edit: If it matters, our CAE and containers are on the consumption model.

AndyThurgood commented 2 weeks ago

An update from the original reporter. We have mitigated this issue for now by forcing our containers to never scale to zero, but we are fearful of those containers being force restarted by Azure.

If it helps, the only way we seem to be able to get those services back online, was to repeatedly scale the revision up and down in terms of minimum instances. Even then, it took sometime 15 minute plus for the revision to activate.

From our perspective, this was a really challenging issue because we had no way to diagnose what was happening other than looking at the limited logs that were (sometimes) being generated by the system console of a container, even then we didn't get any information that pointed to what was happening.

When a container eventually failed to activate, the azure portal had no information in any of the logs, meaning we were pretty stuck.

This seems like a capacity issue, which is (I think) why restarting the containers again and again eventually got our containers online.

rajravat commented 2 weeks ago

The only way we were able to get the container to activate was to delete the container app and redeploy, the issue did eventually reappear when the container scaled to 0 and then tried to scale back up, but it's been intermittent, its been working "ok" today

simonjj commented 2 weeks ago

This issue should be resolved now. The impact should have been limited to uksouth. Please notify us here if there continue to be issues with spinning up new revisions or replicas.

jsheetzmt commented 2 weeks ago

This impacted us in Central US as well. Looks to be resolved now

davidstarkcab commented 2 weeks ago

We are facing this issue here in Western Europe.

flannoo commented 2 weeks ago

This issue is happening on our various container apps environments as well (dev, quality, production). We are also using the consumption model and are hosting this in West Europe. We opened a Microsoft support case, hopefully it gets resolved soon.

jcools85 commented 2 weeks ago

Same issue on all environments in West-Europe. Consumption workload profiles

jurepurgar commented 2 weeks ago

We have the same issue in all environments in West Europe. New replicas are not activating.

wouterbruining commented 2 weeks ago

I'm also having this issue in West-Europe, all my replica's are down and won't start. Scaling up and down like somebody suggested seems to work sometimes.

Seeing this error:

"ScaledObject doesn't have correct triggers specification","Reason":"ScaledObjectCheckFailed","EventSource":"KEDA"

czarnero commented 2 weeks ago

I also had this issue with a customer's prod deployment in West Europe yesterday. Only thing that helped was throwing the whole CA away and redeploying (which I know might not be feasible for some unfortunately).

simonjj commented 1 week ago

There might be a few more regions which exhibited this behavior we're mostly cleaned this up across the globe. Please open a new issue if it should pop up again. Thank you all for being patient/diligent/friendly with us.

microsoft / azure-container-apps