microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
355 stars 27 forks source link

Container App Job immediately fails on event trigger start #1207

Closed sonofhammer closed 1 week ago

sonofhammer commented 1 week ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

A container app job with an event trigger worked on Tuesday the 18th, but is failing since Thursday the 20th.

Manual "Run Now" button in the portal still executes the container app job without issue.

Here's what happens:

Event triggers successfully on service bus queue, and creates a pod.

Successfully created pod for Job Execution '<job name redacted>'

followed by

Replica '<replica name redacted>' for Job Execution 'job name redacted' has been scheduled to run on a node.

But then immediately goes into

Pod - <replica name redacted> has exited with status Failed

with a reason of PodDeletion

We're not even getting to the image pull log line. It just fails immediately on start.

I do not know if these details matter but here they are

Region - eastus Container registry credentials - admin credentials Container registry location - in a separate subscription from container app environment

Steps to reproduce

  1. create a container job
  2. create a service bus queue
  3. Configure the container job to trigger off of the service bus queue
  4. load up a message on the queue

Expected behavior [What you expected to happen.] Container app job is triggered successfully, the image is pulled, the container is started and it executes.

Actual behavior [What actually happened.] The pod gets deleted before the image even gets a chance to be pulled.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

Ex. Did this issue occur in the CLI or the Portal?

sg-vintri commented 1 week ago

We are facing this issue too and the production systems are impacted. Looks like supporting image for container app jobs is rolled back from version 1.39.6 to version 1.0.8. This is a breaking change for us.

ruvintri commented 1 week ago

We are facing this issue too and the production systems are impacted. Looks like supporting image for container app jobs is rolled back from version 1.39.6 to version 1.0.8. This is a breaking change for us.

We are triggered from Azure Storage Queues and facing a similar issue. As SG stated the only difference with 'mcr.microsoft.com/k8se/msi-transition:1.0.8-m' is that we do not see "Created container " nor "Started container " in the system logs for the container app job.

Curious why the team would have rolled back from msi-transition:1.39.6-m to msi-transition:1.0.8-m.

To mitigate this production issue we have switched the ACA Jobs to manual and created a temporary queue watch/job trigger app.

chinadragon0515 commented 1 week ago

@sonofhammer @sg-vintri @ruvintri For msi-transition, the tag change is expected, here is detail, Before this side car always have the same tag as other system components like 1.39.6, but like other side car, the change of this side car is much less frequent than other system components, so we decide to have separate tag for this side car which is same for other side car you see like envoy-sc side car, they will have same tag.

From code base view, even the tag is changed, the code base is exactly same.

Do you see issue before? Can you send your environment information to acasupport at microsoft dot com so we can check the log to see what could be wrong and the exactly timestamp you start see the issue.

ruvintri commented 1 week ago

@chinadragon0515 email sent

chinadragon0515 commented 1 week ago

We confirm there is a regression for job with event trigger and managed identity running on consumption v2 environment. We are working on fix.

chinadragon0515 commented 1 week ago

This is RCA: The issue is caused by an KEDA version upgrade and introduced a behavior change in latest deployment. All impacted environments have been mitigated via roll back to old Keda version.

If you still meet the issue, email the timestamp of issue and environment information to us, we will check. thanks.

sonofhammer commented 4 days ago

We've confirmed that it has been fixed for us.

Thank you.

jason-berk-k1x commented 4 days ago

yeah, late last week everything broke.....today it's all working:

Screenshot 2024-06-24 at 10 10 58 AM

is there anyway to "subscribe" to these changes?? breaking my jobs is one thing....not being informed of the changes to the platform that broke me is another....

vinisoto commented 4 days ago

adding a link to the regression announcement: https://github.com/microsoft/azure-container-apps/issues/1211