microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
355 stars 27 forks source link

container jobs fail with AssigningReplicaFailed error #1168

Open jason-berk-k1x opened 1 month ago

jason-berk-k1x commented 1 month ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

Looking at my execution history, I see a ton of jobs that failed. In an attempt to understand why they failed, I ran this query on my logs:

ContainerAppSystemLogs_CL
| where TimeGenerated >= datetime(2024-01-01)
| where Type_s != 'Normal'
| where Reason_s != 'KEDAScalerFailed'
| summarize Count = count(), LatestOccurrence = max(TimeGenerated) by Reason_s
| project Reason_s, Count, LatestOccurrence
| order by Count desc

got back this result:

Screenshot 2024-05-14 at 12 06 15 PM

what is AssigningReplicaFailed? I can't find any documentation as to why this is happening or what I can do to fix the issue.

Steps to reproduce

setup a container app job that is triggered by KEDA when a queue has any messages

Expected behavior [What you expected to happen.]

some sort of documentation around these error Reasons

Actual behavior [What actually happened.]

I'm dead in the water with no idea why I'm getting this error so often

Screenshots
If applicable, add screenshots to help explain your problem.

Screenshot 2024-05-14 at 12 12 00 PM

Screenshot 2024-05-14 at 12 14 06 PM

Screenshot 2024-05-14 at 12 15 06 PM

mm-supernice commented 1 month ago

same here but only since enabling NAT GW

jason-berk-k1x commented 1 month ago

I opened a support ticket.... didn't here anything from anyone for over a week. They finally got back to me on 5/24 and after a 90 minute phone call they weren't able to offer any help or insights as to what this error even means. They said they were going to escalate my ticket 🤷

jason-berk-k1x commented 1 week ago

ran the query today just to have something to compare to:

Screenshot 2024-06-17 at 5 26 06 PM

so in 90 days I've gotten 25361 failures which is roughly 281 times a day

pwarner commented 1 week ago

We're having the same problem. Using the D16 workload profile, and we're seeing 50% of job executions encounter this error, increasing their time from 20 seconds by over 2 minutes.

jason-berk-k1x commented 1 week ago

my jobs are configured to timeout after 15 minutes and I poll the service bus queue ever 10 seconds.

my current theory is that we run into issues mostly when, for whatever reason, there's no capacity on the nodes that Azure manages and we have to wait for them to add more nodes to the pool....then pull the image...which is severely bloated at almost 9GiB, onto that new hardware....then start running the job. Also, mind you, the 15 minute clock starts ticking when the message hits the queue and KEDA tries to start a job.... NOT when the job actually starts running

I've seen the exact same processing take three minutes when it runs on a node pool that already has pulled the image and has capacity to run a job.....then take 12 to 13 minutes when everything is starting cold

mm-supernice commented 1 week ago

the problem is, if you set Min Executions to 0 there is no node to schedule the container on. Upon start of the job the node will be initialized but that takes more time than the timeout. On the next run however it will work - until after some time the node is destroyed again and scaled down to 0.

You can avoid the issue by setting at least 1 node as minimum although minimum of 3 is recommended. This does not happen on the consumption plan because there the jobs run on shared nodes, they are always running.

jason-berk-k1x commented 1 week ago

This does not happen on the consumption plan because there the jobs run on shared nodes, they are always running.

right, those consumption plan nodes are always running.....but they might not have capacity to run your job...and now your job is just waiting for new nodes to be added to that node pool....or maybe for capacity to become available. If new nodes are provisioned, then the job will likely need to pull the image fresh....adding more delay

jason-berk-k1x commented 1 week ago

as an aside, I'm also struggling with the math of the dedicated plans. For example, if I create a new container app environment, and define a D4 profile, I can run my job on either:

on the consumption plan with max resources of 4 cores and 8GiB of RAM, at less than one cent / minute my 5 minute job will cost me $0.05

if I use the D4 profile instead......the price now depends on how you have the min/max nodes configured. If I set the min to zero to emulate the consumption plan, then I don't feel like I'm really gaining anything because I still have to tolerate a cold node start if my jobs are intermittent and bursty (which they are). Even after the code start, if I'm comparing apples to apples.....I'd be using the entire 4 cores of the D4 meaning I could only run one job per node. I understand I could set my max nodes to 25 and totally have 25 nodes each running a single job...which is an experiment I'm about to go run.

All that said, if I just keep one D4 node running, that would run me about $225 / month and the vast majority of that time, NOTHING would be running on that node.

I don't see how I could get anywhere near 4 cores and 8GB per job and have capacity for 25 jobs (the app env has a soft limit of 100 cores) outside of the consumption plan that's price comparable..... especially when

The first 180,000 vCPU-seconds, 360,000 GiB-seconds, and 2 million requests each month are free. Beyond that, you pay for what you use on a per second basis based on the number of vCPU-s and GiB-s your applications are allocated.