Container App job loses access to pull image from ACR after a period of time (still)

microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps

MIT License

355 stars 27 forks source link

Container App job loses access to pull image from ACR after a period of time (still) #1204

Open jason-berk-k1x opened 1 week ago

jason-berk-k1x commented 1 week ago

Please provide us with the following information:

This issue is a: (mark with an x)

[x] bug report -> please search issues before submitting
[ ] documentation issue or request
[x] regression (a behavior that used to work and stopped in a new release)

Issue description

https://github.com/microsoft/azure-container-apps/issues/816#issuecomment-2181193648

Steps to reproduce

no idea, that's part of the issue

Expected behavior [What you expected to happen.]

Actual behavior [What actually happened.]

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

Ex. Did this issue occur in the CLI or the Portal?

jason-berk-k1x commented 1 week ago

not sure why the original issue closed a few days ago. I literally ran into this exact issue today (6/20/24)

https://github.com/microsoft/azure-container-apps/issues/816#issuecomment-2181193648

jason-berk-k1x commented 1 week ago

I have since deleted my job and re-ran my terraform to create a new one. this new job is using the mcr.microsoft.com/k8se/quickstart-jobs:latest image. After the TF runs, I use the azure CLI to update the app based on a configuration yaml file. That pipeline ran and in the portal everything looks correct, but when I put a message on the queue that triggers the job, I see this in the system logs (and the execution fails)

Screenshot 2024-06-20 at 3 18 56 PM

my job lives in the dev subscription and the ACR is in the global subscription but I've got many other jobs all built out the same way and none of them have this issue.

jason-berk-k1x commented 4 days ago

so, all my jobs were working just fine before 6/20, at which point they all started failing. Today, everything is working again.... Looking at the change history, seems Azure updated my environments KEDA version right before everything broke....and reverted it Sunday and today on Monday everything works:

Screenshot 2024-06-24 at 10 10 58 AM

not sure why the KEDA version change would cause the issues I'm seeing, but the timing seem very peculiar....

jason-berk-k1x commented 4 days ago

and I found this: https://github.com/microsoft/azure-container-apps/issues/1207#issuecomment-2183792297

We confirm there is a regression for job with event trigger and managed identity running on consumption v2 environment. We are working on fix.